>_Senior DevOps & SREAvailable

Infrastructure
that doesn't
break.

Senior DevOps & SRE for Series A–C startups.
End-to-end infrastructure ownership — no agencies, no junior handoffs.

Get in Touch →View Case Studies

Years exp.

A–C

Series

AWS · GCP · Azure

Clouds

Based

deployment.log

✓build2m 12s

✓tests (412/412)3m 48s

✓security scan0m 44s

↑rolling update4/5

3/3 pods ready

p99142ms

uptime99.98%

Services

What I Build

Four disciplines. One point of contact.

0101

CI/CD & Automation

Slow, brittle pipelines are a tax on every engineer on your team. I design build systems that fail fast, cache aggressively, and ship reliably — from trunk-based development to full GitOps rollout pipelines with automated rollback. Whether you are running a monorepo or dozens of microservices, I will make your pipeline something engineers trust instead of fear.

GitHub ActionsGitLab CIArgoCDTerraformBuildkitKanikoNexusSonarQube

→

Teams go from 40-minute, flaky pipelines to sub-5-minute builds with zero manual intervention on deployment day.

0202

Kubernetes & Container Orchestration

Running containers on a single machine is easy — running them reliably at scale with zero downtime is not. I set up production-grade Kubernetes clusters from scratch, harden them for multi-tenant workloads, and design Helm chart libraries that make deploying a new service a 10-minute job for any engineer. I also untangle existing clusters that have grown organically into operational nightmares.

KubernetesHelmArgoCDKustomizeIstiocert-managerKedaVelero

→

Zero-downtime deployments become the default, with resource costs cut by 30–50% through right-sizing and autoscaling.

0303

Cloud Infrastructure

Infrastructure that was provisioned by hand accumulates invisible risk — nobody knows what exists, what it costs, or whether it is secure. I replace that with fully codified IaC across AWS, GCP, or Azure: versioned, reviewed, and deployed the same way as application code. I also run cost optimization reviews that routinely surface five-figure annual savings without sacrificing reliability.

TerraformPulumiAWSGoogle CloudAzureTerragruntAWS Control TowerOpen Policy Agent

→

Cloud costs drop 20–40% within the first quarter, and every resource is auditable, reproducible, and compliant.

0404

Platform Engineering

When infrastructure knowledge is siloed in one person, scaling the team means scaling the bottleneck. I build internal developer platforms that give engineers self-service access to infrastructure — production-like environments on demand, standardized service templates, and integrated observability from day one. The result is a platform where the right way is also the easy way.

BackstageCrossplanePrometheusGrafanaOpenTelemetryLokiTerraformGitHub Actions

→

Onboarding a new service drops from a week of back-and-forth tickets to a self-service workflow completed in under an hour.

Engagement

How I Work

Project

A scoped engagement with a defined start, finish, and deliverable. We agree on the outcome up front — whether that is a production-ready Kubernetes platform, a rebuilt CI/CD pipeline, or a cloud infrastructure migration — and I work full-time until it ships. You get focused execution without the distraction of competing priorities, and a handover that includes documentation and knowledge transfer so your team can own it afterwards.

Duration4–12 weeks

CommitmentFull-time

→

Migration projects, platform builds, CI/CD overhauls, or any initiative with a clear before-and-after state.

Retainer

Ongoing senior infrastructure capacity on a part-time basis. I embed with your team 2–3 days per week — attending planning, reviewing PRs, responding to incidents, and working through the infrastructure backlog alongside your engineers. This model works well when you need someone who knows your systems deeply over time, not a consultant who needs to re-learn your stack every quarter.

Duration3–12 months

Commitment2–3 days per week

→

Teams that need senior infra capacity on a sustained basis but are not ready to hire a full-time senior engineer.

Advisory

A lightweight engagement focused on architecture review, incident retrospectives, and technical direction. I review your infrastructure, read your incident reports, and join a weekly or bi-weekly call to give a senior perspective on what your team is building and where the risks are. This model is async-first — most of my input comes through written reviews and comments rather than synchronous meetings.

DurationOngoing

Commitment1 day per week or less

→

Teams with junior or mid-level infra engineers who need a senior lens on architecture decisions, incident reviews, or engineering hiring.

Case Studies

Proven Results

Real problems. Real solutions. Measurable outcomes.

Series B Fintech · 35 engineers

CI Pipeline Overhaul: 47 Minutes to Under 4

The Problem

A Series B fintech startup was running a monorepo with a single GitHub Actions workflow that tested and built every service on every commit. The pipeline took 47 minutes on average, used no caching, and ran all test suites sequentially. Engineers were pushing changes and going for lunch before getting feedback. The slow feedback loop was causing developers to batch unrelated changes into single PRs to avoid multiple long waits, making code review harder and rollbacks more dangerous.

What I Did

I audited the workflow and identified three root causes: no layer caching for Docker builds, no test parallelization, and unnecessary re-runs of tests for services unaffected by a given change. I introduced affected-service detection using git diff against the merge base, so only the services touched by a PR were built and tested. I enabled BuildKit layer caching backed by GitHub Actions cache, cutting Docker build times from 12 minutes to under 90 seconds per service. The test suite was split into parallel matrix jobs. I also separated the lint, unit test, and integration test stages so engineers got lint failures in under 60 seconds rather than waiting 47 minutes.

Pipeline wall time dropped from 47 minutes to 3 minutes 40 seconds for the median PR. Full end-to-end pipeline for a service touching all integration layers runs in under 8 minutes. Engineers pushed 22% more PRs in the month after the change, and average PR size dropped, which the engineering manager attributed directly to the faster feedback loop.

GitHub ActionsDockerBuildKitMonorepoCI OptimizationTest Parallelization

Series A SaaS · B2B · 18 engineers

EC2 to Kubernetes: 40% Cost Reduction and Zero-Downtime Deploys

The Problem

A B2B SaaS company was running 14 microservices on manually provisioned EC2 instances, managed through a mix of Ansible playbooks and hand-edited security groups. Deployments required a maintenance window because the process involved stopping the old instance, replacing it, and restarting. The team had experienced two incidents in the past quarter caused by configuration drift between environments — production had manual changes that staging did not. Monthly AWS costs were EUR 11,400, with most instances overprovisioned at 8–16 vCPUs despite average CPU usage under 10%.

What I Did

I planned and executed a three-month migration to EKS. Phase one was containerization: I wrote production-grade Dockerfiles for each service, introduced multi-stage builds, and ran the containers locally to surface any environment-dependency issues before touching production. Phase two was the Kubernetes layer: I set up an EKS cluster using Terraform, defined a standardized Helm chart library shared across all services, and configured Horizontal Pod Autoscaler for the three latency-sensitive services. Phase three was the cutover: I ran both environments in parallel for two weeks, using weighted routing in Route 53 to shift 10% of traffic to EKS initially, monitoring error rates and p99 latency before shifting the remainder. Deployments moved to a rolling update strategy managed by ArgoCD, with automatic rollback triggered by Prometheus alerts on error rate.

Monthly AWS infrastructure costs dropped from EUR 11,400 to EUR 6,700 — a 41% reduction — through right-sizing pods and using Spot instances for stateless workloads. Zero-downtime deployments became the default; the team has not had a deployment-related incident since the migration. The configuration drift issues were eliminated because all environment configuration is now defined in Helm values files and versioned in Git.

KubernetesEKSTerraformHelmArgoCDAWSCost OptimizationZero-Downtime Deploys

Growth-Stage E-Commerce · 25 engineers

From Zero Visibility to Full-Stack Observability in 6 Weeks

The Problem

A growth-stage e-commerce company had no centralized logging, no distributed tracing, and alerts that consisted of a single CloudWatch CPU alarm that fired too late to be useful. When incidents occurred, engineers SSH-ed into individual containers to read logs. During a Black Friday preparation review, it became clear that the team had no way to identify which service was causing latency spikes in their checkout flow — they only found out something was wrong when customers complained or conversion rates dropped in their analytics dashboard 30 minutes later.

What I Did

I implemented a full observability stack using open-source tooling to avoid vendor lock-in. For metrics, I deployed Prometheus with service discovery configured to automatically scrape all Kubernetes workloads, and built Grafana dashboards covering the four golden signals for each service. For logs, I deployed the Grafana Loki stack with Promtail as the log collector, which replaced the team's previous habit of SSH-ing into pods. For distributed tracing, I instrumented the four core backend services with OpenTelemetry SDKs and shipped traces to Tempo. The final piece was alerting: I defined alert rules in Prometheus for error rate, latency p99, and queue depth, routing them through Alertmanager to a dedicated Slack channel with clear runbook links in each alert.

Mean time to detect (MTTD) dropped from approximately 25 minutes (customer-reported) to under 90 seconds for the alert cases covered by the new rules. During the next peak traffic event, the team identified and resolved a database connection pool exhaustion issue in 12 minutes — an issue they estimate would have caused a 2–3 hour outage under the previous setup. The engineering team now resolves P1 incidents without any SSH access to production containers.

PrometheusGrafanaLokiOpenTelemetryTempoKubernetesObservabilityAlertmanager

Dominic Cardellino, Senior DevOps & SRE Engineer

About

Who I Am

I've spent 8+ years building and operating production infrastructure across startups at every stage — from seed-stage teams of 5 to Series C companies with 200 engineers. The same pattern repeats: product grows, infrastructure becomes an afterthought, and eventually the shortcuts compound into something that stops the team from shipping. I've been the person called in to clean that up, and I've been the person who built systems that quietly held up under pressure for years without incident.

I went freelance because the direct path works better. Working with one company at a time, as a senior engineer with full ownership rather than a consultant managing a relationship, gets better outcomes for everyone. There's no account manager, no junior doing the actual work, no PowerPoint strategy deck. There's me, your codebase, and a defined problem to solve.

What I care about is building infrastructure that the next engineer can understand — systems that are documented, opinionated for good reasons, and designed to hold up at 3am when something goes wrong. If you're at the point where infrastructure is slowing your team down or keeping you up at night, that's exactly the kind of problem I'm here to fix.

Stack