AVAILABLE for senior roles
Lagos, Nigeria → Remote / Relocation
CKA · CKAD · AWS · Terraform
DevOps · MLOps · Platform Engineering

Amanyi
Daniel

$

DevOps and Platform Engineer with 4+ years designing and operating Kubernetes-native systems across AWS and on-prem environments. Strong focus on reliability engineering, failure handling, and infrastructure consistency. Known for diagnosing complex infra failures, minimizing blast radius, and maintaining control plane stability under stress.

4+
Years Production
7
Certifications
5
Projects Shipped
Incidents Survived

Project Case Studies

Each project documented as an engineering breakdown — problem context, design decisions, trade-offs, and what I'd change.

PROJ-001 · INFRASTRUCTURE RESEARCH
TITAN — Bare-Metal Kubernetes Testbed
Fault tolerance semantics and controlled failure injection on a 3-node KVM/libvirt cluster
In Progress Kubernetes Bare-Metal
+

Managed Kubernetes services (EKS, GKE) abstract too much. When production incidents happen, I need to understand what the control plane is actually doing — not just what the cloud vendor reports. TITAN exists to study cluster behavior under real failure conditions: node loss, network partition, etcd degradation, and resource exhaustion.

Three KVM guests on a single physical host. One control plane node, two workers. Libvirt manages VM lifecycle; networking is intentionally constrained to simulate real inter-node latency. The point is not to replicate production — it's to create a controlled environment where I can inject failures and observe cascading behavior without paging anyone at 2am.

KVM / libvirt kubeadm Cilium CNI Prometheus Grafana Chaos Mesh etcd
  • Single-host colocation — failure domains overlap (intentional for this phase)
  • No cloud networking — must simulate latency and partition manually
  • Runbook-first discipline: every failure injection has a pre-defined runbook before execution
  • Phase (-1) is purely a fluency gate — no research claims until baseline competency is verified
⚠ What I'd Do Differently
Three nodes on one host gives you operational control but loses true failure domain isolation. The next evolution requires physical separation — either dedicated mini-PCs or rented bare-metal. I'm documenting this limitation explicitly so that any research output from TITAN carries the correct caveats.
PROJ-002 · DATA PIPELINE
Telemetry Streamline — Serverless Event Pipeline
Real-time telemetry ingestion on AWS: Kinesis → Lambda → DynamoDB / S3 / OpenSearch
Shipped AWS Serverless
+

The brief was to ingest high-velocity telemetry events, fan them out across three storage backends (hot, cold, and search), and do it without managing servers. The real constraint: the cost model had to stay predictable — no provisioned capacity bills appearing at end of month.

Kinesis Data Streams as the ingest buffer gives replay capability and decouples producers from consumers. Lambda shards off the stream — one function per concern. DynamoDB for hot read access (single-digit ms latency SLA), S3 for cold archival, OpenSearch for full-text query. The fan-out happens inside Lambda, not via SNS — this reduces hop count and keeps the critical path tight.

AWS Kinesis Lambda (Python) DynamoDB S3 OpenSearch IAM CloudWatch
  • Lambda cold start latency on the OpenSearch write path needed a keep-warm strategy
  • DynamoDB write amplification from hot partition keys — resolved with composite key design
  • IAM scope creep: initial role was too permissive; hardened to resource-level least privilege
  • OpenSearch cluster sizing: started too small, index rotation policy required mid-flight
⚠ Failure Log
First deployment had no dead-letter queue. An upstream schema change caused Lambda to throw silently — events dropped with no alert. Lesson: every event pipeline needs a DLQ and an alarm on it before anything else ships. This is now a checklist item I apply to every async architecture.
PROJ-003 · SECURITY ENGINEERING
PKI / mTLS Service Mesh Engagement
Zero-trust service identity for a client microservices environment using mutual TLS and internal CA
Consulting Security Kubernetes
+

Client had 14 microservices communicating over plain HTTP inside a Kubernetes cluster. Compliance requirements mandated in-transit encryption with mutual authentication — not just TLS, but provable service identity. The team had no PKI experience and needed an operational solution, not a proof of concept.

Internal CA provisioned via cert-manager on the cluster. Certificates issued per-service using workload identity (SPIFFE). mTLS enforced at the sidecar proxy layer — no application code changes required. Certificate rotation automated with a 24-hour TTL; rotation failures trigger PagerDuty before expiry. Root CA stored offline; intermediate CA rotated annually.

cert-manager Istio SPIFFE/SPIRE Vault PKI Kubernetes OpenSSL
  • Sidecar proxy adds ~15ms p99 latency per hop — acceptable given compliance requirement
  • Vault PKI backend chosen over self-managed CA for audit trail and secrets governance
  • Short-lived certs (24h) increase rotation ops burden — mitigated by full automation
  • SPIFFE over manual CN — harder to implement, but eliminates human error in cert naming
⚠ What Almost Went Wrong
During rollout, a misconfigured NetworkPolicy blocked the cert-manager webhook from reaching the API server. Services that failed cert renewal silently degraded instead of erroring loudly. Root cause: no integration test for cert issuance path in staging. Added a synthetic cert-request probe to the monitoring stack immediately after.
PROJ-004 · PRACTITIONER WRITING
SRIM — Systems Resilience Infrastructure Manual
21-chapter practitioner manuscript on infrastructure resilience engineering
Writing Resilience Architecture
+

A 21-chapter practitioner-focused manual covering infrastructure resilience from first principles. Five main parts plus a philosophy chapter and bridge chapter — written from operational experience, not theory. Covers failure modes, recovery design, observability, runbook discipline, and the human factors in incident response.

Most resilience content is either too abstract (academic papers) or too narrow (vendor documentation). SRIM is the resource I wished existed when I started dealing with production incidents — opinionated, practitioner-grade, and grounded in systems that actually failed. Writing it forced me to articulate assumptions I'd been operating on implicitly for years.

PROJ-005 · AI/ML INFRASTRUCTURE
Smart Invoice AI Pipeline
Cloud-native ML document processing: S3 → Textract → Bedrock → DynamoDB, fully Terraform-managed
Shipped AWS AI/ML
+

Invoice processing is a high-volume, error-prone manual operation in most finance workflows. The goal was to build a production-ready serverless pipeline that handles raw document uploads, extracts structured fields using real AI services, and persists results — all with pay-per-use cost characteristics and zero server management.

Two Lambda functions — one for upload handling (API Gateway → S3), one for inference — decoupled via S3 event trigger. Textract handles OCR and form extraction; Bedrock (Claude 3 Haiku) runs intelligent field normalization for vendor names, amounts, and dates that rule-based extraction misses. Results land in DynamoDB for downstream consumption. All infra is modular Terraform — six discrete modules covering compute, storage, IAM, API Gateway, and both Lambda functions.

AWS Lambda (Python) AWS Textract Amazon Bedrock Claude 3 Haiku S3 DynamoDB API Gateway Terraform IAM
  • Hybrid extraction strategy — rule-based for common patterns, AI for edge cases — keeps Bedrock costs to ~$0.0005/invoice
  • S3 event trigger over SNS: fewer hops, tighter coupling acceptable here since upload and inference are always paired
  • DynamoDB as output store: optimized for downstream single-key reads, not analytics — a warehouse would be added for reporting use
  • Modular Terraform over a monolith: each AWS resource category is independently versioned and reusable across projects
  • Total per-invoice cost: ~$0.002 — Textract dominates at $0.0015/page
⚠ What I'd Do Differently
No dead-letter queue on the inference Lambda in v1 — a Bedrock throttle or Textract failure would silently drop the event. Also, confidence scoring from Textract isn't surfaced in the final output schema yet; adding a confidence_breakdown field per extracted field would make result quality auditable downstream. Both are documented as v2 items.
↗ View on GitHub

How I Think About Systems

Opinionated takes from real operational experience. Not definitions — how I actually approach problems.

01
Observability is a design decision, not a retrofit
If you're adding metrics after an incident, you've already lost. Observability needs to be specified at design time — what questions will operators ask at 3am? Those questions determine your instrumentation. Logs that don't answer operational questions are just storage costs.
Observability
02
Runbooks that aren't tested aren't runbooks
A runbook written during a postmortem and never executed again is documentation theater. Every runbook in my environments has an associated drill schedule. If the steps are wrong when it's calm, they're catastrophically wrong when it's not. This is the TITAN doctrine applied operationally.
Operations
03
Toil is a capacity problem disguised as a process problem
Teams that do a lot of manual operational work usually frame it as "we just need better runbooks." The real problem is that every hour spent on toil is an hour not spent on reliability improvements. Automation isn't about laziness — it's about preserving engineering capacity for work that actually requires thinking.
Platform Engineering
04
Blast radius matters more than failure probability
I care less about how likely a failure is than about how much damage it causes. A 0.01% probability event that takes down your entire data plane is more dangerous than a 10% probability event affecting one service. Architecture decisions should be driven by consequence, not likelihood.
Reliability
05
CI/CD pipelines are the first indicator of team health
Show me a team's deployment pipeline and I can tell you how they work. Flaky tests that "everyone knows to ignore," multi-hour build times, no staging parity — these aren't CI/CD problems. They're signals about code ownership, test discipline, and whether engineering leadership is measuring the right things.
CI/CD
06
Kubernetes is not a deployment target — it's an operating model
Organizations that treat Kubernetes as "just a better way to deploy containers" consistently struggle with it. The operational model — declarative state, controller reconciliation, pod lifecycle — requires a different mental model than traditional server-based thinking. You're not managing machines anymore; you're managing desired state.
Infrastructure

Technical Stack

Grouped by function. Everything listed has been used in production systems.

☁ Cloud Platforms
AWS GCP Azure
⎈ Container Orchestration
Kubernetes (EKS) Rancher Bare-Metal K8s Helm ArgoCD Docker
⚙ Infrastructure as Code
Terraform Ansible CloudFormation Pulumi
📡 Observability
Prometheus Grafana OpenTelemetry Loki CloudWatch PagerDuty
🔄 CI/CD
Jenkins GitHub Actions GitLab CI ArgoCD Tekton
🔐 Security & Networking
cert-manager Vault Istio Cilium SPIFFE OPA
🤖 AI / ML Platform (Emerging)
vLLM Triton Kubeflow MLflow Ray
💻 Languages & Scripting
Python Bash Go YAML HCL

Certifications

CKA
Certified Kubernetes Administrator
CNCF / Linux Foundation
CKAD
Certified Kubernetes App Developer
CNCF / Linux Foundation
SAA
AWS Solutions Architect Associate
Amazon Web Services
DVA
AWS Developer Associate
Amazon Web Services
SOA
AWS SysOps Administrator Associate
Amazon Web Services
TF
HashiCorp Terraform Associate
HashiCorp
N+
CompTIA Network+
CompTIA

Experience

Jan 2026 – Present
DevOps / Platform Engineer — Consultant
Networksx (Australia)
  • Designed mTLS-based identity architecture for distributed AI agent communication — evaluated CRL vs OCSP trade-offs under failure and latency conditions
  • Built certificate lifecycle and key rotation strategies for large-scale deployments; identified trust-chain and control-plane failure risks
  • Advised on resilience patterns and blast radius containment strategies across distributed systems
May 2024 – May 2025
DevOps Engineer
Capshall (UK)
  • Designed and operated AWS infrastructure with Terraform across multiple environments — resolved state drift, enforced IaC consistency
  • Built CI/CD pipelines (Jenkins, CodePipeline) improving deployment reliability and eliminating manual intervention
  • Managed Kubernetes workloads on EKS: scheduling issues, resource contention, networking misconfigurations
  • Deployed observability stack (Prometheus, Grafana, ELK) for real-time monitoring and incident debugging
  • Hardened IAM boundaries and improved reliability across production workloads
Oct 2022 – Apr 2024
Cloud Engineer
Cloudboosta (UK)
  • Provisioned multi-AZ AWS infrastructure with Terraform and CloudFormation, reducing manual provisioning overhead
  • Supported serverless and microservices systems across Lambda, API Gateway, and ECS
  • Designed secure VPC architectures, IAM policies, and network isolation strategies
  • Optimized cloud cost and performance through right-sizing and resource planning
Jun 2020 – Oct 2022
Systems Engineer
Urban Cloudworks
  • Managed Linux-based infrastructure and web hosting environments (Nginx, Apache)
  • Automated operational workflows using Bash and Python
  • Maintained system uptime and resolved production issues across client environments

About

I'm a DevOps and Platform Engineer based in Lagos, Nigeria, with 4+ years of production experience across AWS, Kubernetes, and distributed infrastructure. My work sits at the intersection of systems reliability, platform automation, and operational engineering.

I'm not a tool operator — I'm an engineer who understands why systems fail and how to build them so that when they do, the damage is contained and the recovery is fast. I've designed mTLS identity architectures, built serverless AI pipelines, operated EKS and bare-metal Kubernetes, and written the runbooks that others follow.

Currently building toward AI Platform Engineering — specifically the infrastructure layer for inference systems: GPU-aware scheduling, serving stacks (vLLM, Triton), and AI-specific observability. Also actively building TITAN, a bare-metal Kubernetes research testbed focused on distributed system behavior and fault tolerance semantics.

BSc Computer Science, Crawford University Nigeria (CGPA 4.3/5.0). Open to senior roles with relocation to Europe or Canada.

Ownership over process
I take end-to-end responsibility for systems I build. If something I deployed caused an incident, I'm the first call — not the last.
Depth over breadth
I'd rather understand five tools deeply than have surface familiarity with twenty. Real operational judgment comes from seeing the failure modes, not the happy path.
Production is the only ground truth
Staging environments lie. The real behavior of a distributed system only emerges under real load, real failures, and real operators making real decisions under pressure.
Documentation is engineering output
Runbooks, architecture decisions, postmortems — these are as much a part of the system as the code. A system that only one person understands is a liability.

If you're building serious infrastructure,
let's talk.

Available for senior DevOps, Platform Engineering, and MLOps roles. Open to remote positions and relocation to Europe or Canada. Response time: usually within 24 hours.