Amanyi Daniel — DevOps / MLOps Engineer

Engineering Work

Project Case Studies

Each project documented as an engineering breakdown — problem context, design decisions, trade-offs, and what I'd change.

PROJ-001 · INFRASTRUCTURE RESEARCH

TITAN — Bare-Metal Kubernetes Testbed

Fault tolerance semantics and controlled failure injection on a 3-node KVM/libvirt cluster

In Progress Kubernetes Bare-Metal

Problem Context

Managed Kubernetes services (EKS, GKE) abstract too much. When production incidents happen, I need to understand what the control plane is actually doing — not just what the cloud vendor reports. TITAN exists to study cluster behavior under real failure conditions: node loss, network partition, etcd degradation, and resource exhaustion.

Architecture Thinking

Three KVM guests on a single physical host. One control plane node, two workers. Libvirt manages VM lifecycle; networking is intentionally constrained to simulate real inter-node latency. The point is not to replicate production — it's to create a controlled environment where I can inject failures and observe cascading behavior without paging anyone at 2am.

Stack

KVM / libvirt kubeadm Cilium CNI Prometheus Grafana Chaos Mesh etcd

Constraints & Trade-offs

Single-host colocation — failure domains overlap (intentional for this phase)
No cloud networking — must simulate latency and partition manually
Runbook-first discipline: every failure injection has a pre-defined runbook before execution
Phase (-1) is purely a fluency gate — no research claims until baseline competency is verified

⚠ What I'd Do Differently

Three nodes on one host gives you operational control but loses true failure domain isolation. The next evolution requires physical separation — either dedicated mini-PCs or rented bare-metal. I'm documenting this limitation explicitly so that any research output from TITAN carries the correct caveats.

PROJ-002 · DATA PIPELINE

Telemetry Streamline — Serverless Event Pipeline

Real-time telemetry ingestion on AWS: Kinesis → Lambda → DynamoDB / S3 / OpenSearch

Shipped AWS Serverless

Problem Context

The brief was to ingest high-velocity telemetry events, fan them out across three storage backends (hot, cold, and search), and do it without managing servers. The real constraint: the cost model had to stay predictable — no provisioned capacity bills appearing at end of month.

Architecture Decisions

Kinesis Data Streams as the ingest buffer gives replay capability and decouples producers from consumers. Lambda shards off the stream — one function per concern. DynamoDB for hot read access (single-digit ms latency SLA), S3 for cold archival, OpenSearch for full-text query. The fan-out happens inside Lambda, not via SNS — this reduces hop count and keeps the critical path tight.

Stack

AWS Kinesis Lambda (Python) DynamoDB S3 OpenSearch IAM CloudWatch

Challenges

Lambda cold start latency on the OpenSearch write path needed a keep-warm strategy
DynamoDB write amplification from hot partition keys — resolved with composite key design
IAM scope creep: initial role was too permissive; hardened to resource-level least privilege
OpenSearch cluster sizing: started too small, index rotation policy required mid-flight

⚠ Failure Log

First deployment had no dead-letter queue. An upstream schema change caused Lambda to throw silently — events dropped with no alert. Lesson: every event pipeline needs a DLQ and an alarm on it before anything else ships. This is now a checklist item I apply to every async architecture.

PROJ-003 · SECURITY ENGINEERING

PKI / mTLS Service Mesh Engagement

Zero-trust service identity for a client microservices environment using mutual TLS and internal CA

Consulting Security Kubernetes

Problem Context

Client had 14 microservices communicating over plain HTTP inside a Kubernetes cluster. Compliance requirements mandated in-transit encryption with mutual authentication — not just TLS, but provable service identity. The team had no PKI experience and needed an operational solution, not a proof of concept.

Design Approach

Internal CA provisioned via cert-manager on the cluster. Certificates issued per-service using workload identity (SPIFFE). mTLS enforced at the sidecar proxy layer — no application code changes required. Certificate rotation automated with a 24-hour TTL; rotation failures trigger PagerDuty before expiry. Root CA stored offline; intermediate CA rotated annually.

Stack

cert-manager Istio SPIFFE/SPIRE Vault PKI Kubernetes OpenSSL

Trade-offs Made

Sidecar proxy adds ~15ms p99 latency per hop — acceptable given compliance requirement
Vault PKI backend chosen over self-managed CA for audit trail and secrets governance
Short-lived certs (24h) increase rotation ops burden — mitigated by full automation
SPIFFE over manual CN — harder to implement, but eliminates human error in cert naming

⚠ What Almost Went Wrong

During rollout, a misconfigured NetworkPolicy blocked the cert-manager webhook from reaching the API server. Services that failed cert renewal silently degraded instead of erroring loudly. Root cause: no integration test for cert issuance path in staging. Added a synthetic cert-request probe to the monitoring stack immediately after.

PROJ-004 · PRACTITIONER WRITING

SRIM — Systems Resilience Infrastructure Manual

21-chapter practitioner manuscript on infrastructure resilience engineering

Writing Resilience Architecture

What It Is

A 21-chapter practitioner-focused manual covering infrastructure resilience from first principles. Five main parts plus a philosophy chapter and bridge chapter — written from operational experience, not theory. Covers failure modes, recovery design, observability, runbook discipline, and the human factors in incident response.

Why I Wrote It

Most resilience content is either too abstract (academic papers) or too narrow (vendor documentation). SRIM is the resource I wished existed when I started dealing with production incidents — opinionated, practitioner-grade, and grounded in systems that actually failed. Writing it forced me to articulate assumptions I'd been operating on implicitly for years.

PROJ-005 · AI/ML INFRASTRUCTURE

Smart Invoice AI Pipeline

Cloud-native ML document processing: S3 → Textract → Bedrock → DynamoDB, fully Terraform-managed

Shipped AWS AI/ML

Problem Context

Invoice processing is a high-volume, error-prone manual operation in most finance workflows. The goal was to build a production-ready serverless pipeline that handles raw document uploads, extracts structured fields using real AI services, and persists results — all with pay-per-use cost characteristics and zero server management.

Architecture Design

Two Lambda functions — one for upload handling (API Gateway → S3), one for inference — decoupled via S3 event trigger. Textract handles OCR and form extraction; Bedrock (Claude 3 Haiku) runs intelligent field normalization for vendor names, amounts, and dates that rule-based extraction misses. Results land in DynamoDB for downstream consumption. All infra is modular Terraform — six discrete modules covering compute, storage, IAM, API Gateway, and both Lambda functions.

Stack

AWS Lambda (Python) AWS Textract Amazon Bedrock Claude 3 Haiku S3 DynamoDB API Gateway Terraform IAM

Trade-offs & Design Decisions

Hybrid extraction strategy — rule-based for common patterns, AI for edge cases — keeps Bedrock costs to ~$0.0005/invoice
S3 event trigger over SNS: fewer hops, tighter coupling acceptable here since upload and inference are always paired
DynamoDB as output store: optimized for downstream single-key reads, not analytics — a warehouse would be added for reporting use
Modular Terraform over a monolith: each AWS resource category is independently versioned and reusable across projects
Total per-invoice cost: ~$0.002 — Textract dominates at $0.0015/page

⚠ What I'd Do Differently

No dead-letter queue on the inference Lambda in v1 — a Bedrock throttle or Textract failure would silently drop the event. Also, confidence scoring from Textract isn't surfaced in the final output schema yet; adding a confidence_breakdown field per extracted field would make result quality auditable downstream. Both are documented as v2 items.

↗ View on GitHub

Engineering Philosophy

How I Think About Systems

Opinionated takes from real operational experience. Not definitions — how I actually approach problems.

Observability is a design decision, not a retrofit

If you're adding metrics after an incident, you've already lost. Observability needs to be specified at design time — what questions will operators ask at 3am? Those questions determine your instrumentation. Logs that don't answer operational questions are just storage costs.

Observability

Runbooks that aren't tested aren't runbooks

A runbook written during a postmortem and never executed again is documentation theater. Every runbook in my environments has an associated drill schedule. If the steps are wrong when it's calm, they're catastrophically wrong when it's not. This is the TITAN doctrine applied operationally.

Operations

Toil is a capacity problem disguised as a process problem

Teams that do a lot of manual operational work usually frame it as "we just need better runbooks." The real problem is that every hour spent on toil is an hour not spent on reliability improvements. Automation isn't about laziness — it's about preserving engineering capacity for work that actually requires thinking.

Platform Engineering

Blast radius matters more than failure probability

I care less about how likely a failure is than about how much damage it causes. A 0.01% probability event that takes down your entire data plane is more dangerous than a 10% probability event affecting one service. Architecture decisions should be driven by consequence, not likelihood.

Reliability

CI/CD pipelines are the first indicator of team health

Show me a team's deployment pipeline and I can tell you how they work. Flaky tests that "everyone knows to ignore," multi-hour build times, no staging parity — these aren't CI/CD problems. They're signals about code ownership, test discipline, and whether engineering leadership is measuring the right things.

CI/CD

Kubernetes is not a deployment target — it's an operating model

Organizations that treat Kubernetes as "just a better way to deploy containers" consistently struggle with it. The operational model — declarative state, controller reconciliation, pod lifecycle — requires a different mental model than traditional server-based thinking. You're not managing machines anymore; you're managing desired state.

Infrastructure

Work History

Experience

Jan 2026 – Present

DevOps / Platform Engineer — Consultant

Networksx (Australia)

Designed mTLS-based identity architecture for distributed AI agent communication — evaluated CRL vs OCSP trade-offs under failure and latency conditions
Built certificate lifecycle and key rotation strategies for large-scale deployments; identified trust-chain and control-plane failure risks
Advised on resilience patterns and blast radius containment strategies across distributed systems

May 2024 – May 2025

DevOps Engineer

Capshall (UK)

Designed and operated AWS infrastructure with Terraform across multiple environments — resolved state drift, enforced IaC consistency
Built CI/CD pipelines (Jenkins, CodePipeline) improving deployment reliability and eliminating manual intervention
Managed Kubernetes workloads on EKS: scheduling issues, resource contention, networking misconfigurations
Deployed observability stack (Prometheus, Grafana, ELK) for real-time monitoring and incident debugging
Hardened IAM boundaries and improved reliability across production workloads

Oct 2022 – Apr 2024

Cloud Engineer

Cloudboosta (UK)

Provisioned multi-AZ AWS infrastructure with Terraform and CloudFormation, reducing manual provisioning overhead
Supported serverless and microservices systems across Lambda, API Gateway, and ECS
Designed secure VPC architectures, IAM policies, and network isolation strategies
Optimized cloud cost and performance through right-sizing and resource planning

Jun 2020 – Oct 2022

Systems Engineer

Urban Cloudworks

Managed Linux-based infrastructure and web hosting environments (Nginx, Apache)
Automated operational workflows using Bash and Python
Maintained system uptime and resolved production issues across client environments

Identity

About

I'm a DevOps and Platform Engineer based in Lagos, Nigeria, with 4+ years of production experience across AWS, Kubernetes, and distributed infrastructure. My work sits at the intersection of systems reliability, platform automation, and operational engineering.

I'm not a tool operator — I'm an engineer who understands why systems fail and how to build them so that when they do, the damage is contained and the recovery is fast. I've designed mTLS identity architectures, built serverless AI pipelines, operated EKS and bare-metal Kubernetes, and written the runbooks that others follow.

Currently building toward AI Platform Engineering — specifically the infrastructure layer for inference systems: GPU-aware scheduling, serving stacks (vLLM, Triton), and AI-specific observability. Also actively building TITAN, a bare-metal Kubernetes research testbed focused on distributed system behavior and fault tolerance semantics.

BSc Computer Science, Crawford University Nigeria (CGPA 4.3/5.0). Open to senior roles with relocation to Europe or Canada.

Ownership over process

I take end-to-end responsibility for systems I build. If something I deployed caused an incident, I'm the first call — not the last.

Depth over breadth

I'd rather understand five tools deeply than have surface familiarity with twenty. Real operational judgment comes from seeing the failure modes, not the happy path.

Production is the only ground truth

Staging environments lie. The real behavior of a distributed system only emerges under real load, real failures, and real operators making real decisions under pressure.

Documentation is engineering output

Runbooks, architecture decisions, postmortems — these are as much a part of the system as the code. A system that only one person understands is a liability.

Amanyi
Daniel

Project Case Studies

How I Think About Systems

Technical Stack

Certifications

Experience

About