DevOps and Platform Engineer with 4+ years designing and operating Kubernetes-native systems across AWS and on-prem environments. Strong focus on reliability engineering, failure handling, and infrastructure consistency. Known for diagnosing complex infra failures, minimizing blast radius, and maintaining control plane stability under stress.
Each project documented as an engineering breakdown — problem context, design decisions, trade-offs, and what I'd change.
Managed Kubernetes services (EKS, GKE) abstract too much. When production incidents happen, I need to understand what the control plane is actually doing — not just what the cloud vendor reports. TITAN exists to study cluster behavior under real failure conditions: node loss, network partition, etcd degradation, and resource exhaustion.
Three KVM guests on a single physical host. One control plane node, two workers. Libvirt manages VM lifecycle; networking is intentionally constrained to simulate real inter-node latency. The point is not to replicate production — it's to create a controlled environment where I can inject failures and observe cascading behavior without paging anyone at 2am.
The brief was to ingest high-velocity telemetry events, fan them out across three storage backends (hot, cold, and search), and do it without managing servers. The real constraint: the cost model had to stay predictable — no provisioned capacity bills appearing at end of month.
Kinesis Data Streams as the ingest buffer gives replay capability and decouples producers from consumers. Lambda shards off the stream — one function per concern. DynamoDB for hot read access (single-digit ms latency SLA), S3 for cold archival, OpenSearch for full-text query. The fan-out happens inside Lambda, not via SNS — this reduces hop count and keeps the critical path tight.
Client had 14 microservices communicating over plain HTTP inside a Kubernetes cluster. Compliance requirements mandated in-transit encryption with mutual authentication — not just TLS, but provable service identity. The team had no PKI experience and needed an operational solution, not a proof of concept.
Internal CA provisioned via cert-manager on the cluster. Certificates issued per-service using workload identity (SPIFFE). mTLS enforced at the sidecar proxy layer — no application code changes required. Certificate rotation automated with a 24-hour TTL; rotation failures trigger PagerDuty before expiry. Root CA stored offline; intermediate CA rotated annually.
A 21-chapter practitioner-focused manual covering infrastructure resilience from first principles. Five main parts plus a philosophy chapter and bridge chapter — written from operational experience, not theory. Covers failure modes, recovery design, observability, runbook discipline, and the human factors in incident response.
Most resilience content is either too abstract (academic papers) or too narrow (vendor documentation). SRIM is the resource I wished existed when I started dealing with production incidents — opinionated, practitioner-grade, and grounded in systems that actually failed. Writing it forced me to articulate assumptions I'd been operating on implicitly for years.
Invoice processing is a high-volume, error-prone manual operation in most finance workflows. The goal was to build a production-ready serverless pipeline that handles raw document uploads, extracts structured fields using real AI services, and persists results — all with pay-per-use cost characteristics and zero server management.
Two Lambda functions — one for upload handling (API Gateway → S3), one for inference — decoupled via S3 event trigger. Textract handles OCR and form extraction; Bedrock (Claude 3 Haiku) runs intelligent field normalization for vendor names, amounts, and dates that rule-based extraction misses. Results land in DynamoDB for downstream consumption. All infra is modular Terraform — six discrete modules covering compute, storage, IAM, API Gateway, and both Lambda functions.
confidence_breakdown field per extracted field would make result quality auditable downstream. Both are documented as v2 items.
Opinionated takes from real operational experience. Not definitions — how I actually approach problems.
Grouped by function. Everything listed has been used in production systems.
I'm a DevOps and Platform Engineer based in Lagos, Nigeria, with 4+ years of production experience across AWS, Kubernetes, and distributed infrastructure. My work sits at the intersection of systems reliability, platform automation, and operational engineering.
I'm not a tool operator — I'm an engineer who understands why systems fail and how to build them so that when they do, the damage is contained and the recovery is fast. I've designed mTLS identity architectures, built serverless AI pipelines, operated EKS and bare-metal Kubernetes, and written the runbooks that others follow.
Currently building toward AI Platform Engineering — specifically the infrastructure layer for inference systems: GPU-aware scheduling, serving stacks (vLLM, Triton), and AI-specific observability. Also actively building TITAN, a bare-metal Kubernetes research testbed focused on distributed system behavior and fault tolerance semantics.
BSc Computer Science, Crawford University Nigeria (CGPA 4.3/5.0). Open to senior roles with relocation to Europe or Canada.
Available for senior DevOps, Platform Engineering, and MLOps roles. Open to remote positions and relocation to Europe or Canada. Response time: usually within 24 hours.