Skip to content
Dodo Payments
Jan 19, 2026 12 min read

How We Built Our Deployment Pipeline: GitOps, ArgoCD, and Kubernetes at Dodo Payments

Ayush Agarwal
Ayush Agarwal
Co-founder & CPTO
Banner image for How We Built Our Deployment Pipeline: GitOps, ArgoCD, and Kubernetes at Dodo Payments

It started with a question from our security compliance team: “Can you show me exactly what changed in production last Tuesday at 3 PM?”

We couldn’t. Not with confidence, anyway.

We’re Dodo Payments, a payment and billing infrastructure company. We process real money for businesses across multiple countries. That means PCI DSS compliance isn’t optional  -  it’s existential. And PCI requires something we didn’t have: a complete, auditable trail of every change to our production environment.

But compliance was just the trigger. The underlying problems ran deeper.

Setting the Scene: Why We Needed to Change

When you’re building payment infrastructure, five things keep you up at night:

  • Scalability. Traffic doesn’t grow linearly. A merchant runs a flash sale, and suddenly you’re handling 10x normal volume. Your infrastructure needs to scale automatically, predictably, and without human intervention at 2 AM.
  • Reliability. Every failed request is a failed payment. A failed payment is lost revenue for your customer. Lost revenue means lost trust. In payments, reliability isn’t a feature  -  it’s the product.
  • Audit trails and compliance. PCI DSS, SOC 2, and various regional regulations all require the same thing: prove that you know exactly what’s running in production, who changed it, when, and why. “I think we deployed something last week” doesn’t cut it.
  • Infrastructure monitoring. You can’t fix what you can’t see. When latency spikes or error rates climb, you need to know immediately  -  and you need enough context to diagnose the problem fast.
  • Security and vulnerability management. Payment systems are high-value targets. Every dependency, every container image, every configuration choice is a potential attack surface. You need to know what’s running and whether it’s vulnerable. Our early infrastructure handled none of this well.

The Problems We Were Facing

In the early days, our deployment process was… let’s call it “creative”:

  • Engineers had kubectl access to production
  • Deployments happened via CI pipelines that pushed directly to clusters
  • Configuration lived partly in Git, partly in engineers’ heads
  • Rollbacks meant “find the last working commit and pray” We had the usual suspects: CI jobs, bash scripts, and a shared Slack channel where someone would post “deploying to prod” before running commands.

It worked. Until it didn’t.

  • Configuration drift was everywhere. Production rarely matched what was in Git. Someone would make a “quick fix” directly in the cluster, forget to commit it, and then overwrite it on the next deployment. When the compliance team asked what was running in production, we had to manually inspect the cluster  -  and hope nothing had changed since we last looked.
  • No proper audit trail. When something broke, we’d spend hours figuring out what changed. Was it a code change? A config change? A scaling event? Who approved it? PCI auditors don’t accept “we’re pretty sure it was fine.”
  • Scaling was manual and reactive. We’d notice high CPU usage, SSH into a dashboard, and manually bump replica counts. By the time we reacted, customers had already experienced slowdowns.
  • Monitoring was fragmented. Logs went to one place, metrics to another, and traces to nowhere. Correlating an incident across services meant opening five different tools and hoping timestamps lined up.
  • Vulnerability tracking was nonexistent. We had no systematic way to know which container images were running, what versions of dependencies they contained, or whether any had known CVEs. Security reviews were manual and infrequent.
  • Deployments were scary. Engineers started batching changes together because deployments felt risky. Which made deployments even riskier because they contained more changes. Fear bred more fear. We needed infrastructure that was auditable by default, scalable without intervention, observable end-to-end, and secure by design. We needed to rebuild our deployment story from the ground up.

Enter GitOps

The core idea behind GitOps is simple: Git is the single source of truth for your infrastructure. If you want to change something, you change it in Git. A system watches Git and makes reality match what’s declared there.

This is fundamentally different from traditional CI/CD where a pipeline pushes changes to your infrastructure. In GitOps, an operator running inside your cluster pulls the desired state and reconciles.

Why does this matter?

  • Audit trail for free. Every change is a Git commit. Want to know who changed what and when? git log. Want to roll back? git revert.
  • Drift detection built-in. The operator continuously compares actual state with desired state. If someone makes a manual change to the cluster, the operator reverts it. No more “kubectl apply and forget.”
  • Declarative everything. You describe what you want, not how to get there. The operator figures out the delta and applies it. For a deep dive on GitOps principles, the gitops.tech site is a great starting point.

Why ArgoCD

We evaluated three main options for our GitOps operator: Flux, ArgoCD, and running something custom.

ArgoCD won for us because of a few factors:

  • The UI is actually useful. ArgoCD has a web interface that shows you application state, sync status, resource trees, and logs. When something goes wrong, you can see exactly which resources are out of sync and why. This isn’t just nice-to-have  -  it dramatically reduces debugging time.
  • Multi-source applications. We needed to combine Helm charts from external repos with our own value overrides and additional manifests. ArgoCD’s multi-source feature handles this cleanly.
  • Mature and battle-tested. ArgoCD is a CNCF graduated project used by thousands of organizations. The community is active, the documentation is solid, and when we hit edge cases, someone has usually hit them before.
  • Sync policies that make sense. Automated sync with self-healing and pruning gives us the behavior we want: Git is always truth, and the cluster always converges to match.

Our Repository Structure

We landed on a dedicated infrastructure repository that contains all deployment configuration. Application code lives in separate repos; the infra repo only contains Kubernetes manifests and deployment configuration.

infra/
├── apps/
│   ├── root/                    # ArgoCD Application definitions
│   ├── dev/                     # Development environment
│   │   └── services/
│   │       ├── service-a/
│   │       ├── service-b/
│   │       └── ...
│   └── prod/                    # Production environment
│       └── services/
│           ├── service-a/
│           ├── service-b/
│           └── ...

Each application follows a consistent structure:

service-a/
├── namespace.yaml
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── serviceaccount.yaml
├── secrets-sync.yml
└── hpa.yaml (prod only)

This consistency matters. When someone needs to understand how a service is deployed, they know exactly where to look. When we need to add a new service, we copy an existing one and modify it.

Separating Dev and Production

We run two separate Kubernetes clusters: one for development, one for production. They share the same repository structure but have different configurations.

The differences are deliberate:

Why separate clusters instead of namespaces?

We considered using namespaces within a single cluster, but decided against it for a few reasons:

  • Blast radius. A misconfiguration in dev shouldn’t be able to affect production. Separate clusters provide hard isolation.
  • Cost management. Dev cluster can use cheaper node types and scale down aggressively. Prod needs reliable, appropriately-sized infrastructure.
  • Testing infrastructure changes. We can test cluster-level changes (service mesh upgrades, node pool modifications) in dev before applying to prod.
  • Compliance. Payment processing has strict requirements. Separating environments makes audits cleaner.

How ArgoCD Syncs Our Applications

Each application is defined as an ArgoCD Application resource. Here’s the conceptual structure:

Application: service-a
├── Source: git repository
│   └── Path: apps/prod/services/service-a
├── Destination: production cluster
├── Sync Policy:
│   ├── Automated: true
│   ├── Self-heal: true (revert manual changes)
│   └── Prune: true (delete removed resources)
└── Ignore Differences:
    └── Secrets (managed externally)

The automated sync policy means ArgoCD applies changes as soon as they hit the main branch. No manual intervention needed. If someone modifies something directly in the cluster, self-heal reverts it within minutes.

For complex applications like databases or third-party tools, we use ArgoCD’s multi-source feature:

Application: complex-app
├── Source 1: Helm chart from upstream repo
├── Source 2: Values file from infra repo
└── Source 3: Additional manifests (ingress, service mesh config)

This lets us use external Helm charts while maintaining our own customizations and additional resources.

The Database Proxy Sidecar Pattern

Every service that needs database access runs with a database proxy sidecar. This is a pattern we adopted early and it’s served us well.

Pod
├── Application container (port 8080)
├── Database proxy sidecar (port 5432)
└── Service mesh sidecar (injected)

The database proxy handles:

  • Secure connections to the database without exposing it publicly
  • IAM-based authentication via Workload Identity
  • Connection pooling and keepalives We use Workload Identity to authenticate the proxy. Each service has a Kubernetes ServiceAccount bound to a cloud IAM identity:
ServiceAccount: service-sa
└── Annotation: bound to cloud IAM service account

This means no service account keys floating around. The proxy authenticates using the pod’s identity, which the cloud provider manages automatically.

For CronJobs, we use the native sidecar pattern introduced in Kubernetes 1.28, where the database proxy runs as an init container with restartPolicy: Always. This ensures the proxy stays running for the duration of the job.

Secret Management

Secrets are the hardest part of any deployment pipeline. You can’t store them in Git (obviously), but you need them in your pods.

We use an external secret management solution. Here’s how it works:

  • Secrets live in the secret manager, organized by project and environment
  • A Kubernetes operator runs in each cluster
  • Custom resources tell the operator which secrets to sync
  • The operator creates and updates Kubernetes Secrets automatically

The beauty of this approach:

  • Secrets are versioned and audited in the secret manager
  • Changes propagate automatically (no redeploy needed for secret rotation)
  • ArgoCD ignores secret data (via ignoreDifferences) so it doesn’t fight with the operator
  • Engineers never need to handle raw secrets

Horizontal Pod Autoscaling

Production services run with HPAs that scale based on CPU and memory:

HPA: service-hpa
├── Min replicas: 2
├── Max replicas: 10
├── Target CPU: 70%
├── Target Memory: 70%
└── Scale-down stabilization: 150 seconds

The stabilization window prevents flapping during traffic spikes. We’d rather have slightly over-provisioned capacity than rapid scale-up/scale-down cycles.

Dev environments don’t use HPAs  -  we run fixed replica counts to keep costs predictable.

Graceful Shutdown Handling

Payment services can’t just die. A pod termination might interrupt an in-flight transaction, which is unacceptable.

Every deployment includes graceful shutdown configuration:

Lifecycle:
  preStop: sleep 5 seconds
TerminationGracePeriod: 60 seconds

The preStop sleep gives the service mesh sidecar time to drain connections. The 60-second grace period gives the application time to complete in-flight requests before forced termination.

The Deployment Flow

Here’s what happens when an engineer wants to deploy a change:

  • Code change: Engineer pushes to the application repo
  • CI builds: CI pipeline builds a new Docker image, tags it, pushes to container registry
  • Manifest update: Engineer (or automation) updates the image tag in the infra repo
  • ArgoCD detects: Within 3 minutes, ArgoCD notices the Git change
  • Sync: ArgoCD applies the new manifests to the cluster
  • Rolling update: Kubernetes performs a rolling deployment
  • Health check: ArgoCD verifies the new pods are healthy For dev, step 3 is often automated  -  we update to latest and rely on the Always pull policy. For prod, we use explicit version tags and require a PR to change them.

Rollbacks are just as simple: revert the commit in Git, and ArgoCD syncs back to the previous state.

What We Got Wrong

Not Locking Image Tags Sooner

For months, our dev environment ran latest tags across the board. This worked fine until it didn’t  -  an upstream dependency pushed a breaking change, and suddenly dev was broken in a way that prod wasn’t.

Now we version-pin everything, even in dev. The latest tag is only used for services we control and actively develop. Third-party images always use specific versions.

Underestimating Resource Requests

Kubernetes scheduling is only as good as your resource requests. We initially set conservative requests to maximize node utilization, which led to noisy neighbor problems and unpredictable performance.

After several incidents, we switched to more generous resource requests, especially for memory. Better to have predictable performance with some waste than to hit OOM kills during traffic spikes.

The Results

Six months into our GitOps journey:

  • Deployment frequency increased 4x. Engineers deploy multiple times per day because it feels safe.
  • Rollback time dropped from hours to minutes. git revert + wait for sync.
  • Zero “what changed?” incidents. Every production change has a Git commit with context.
  • Onboarding time halved. New engineers understand the deployment process by reading the infra repo.
  • Audit compliance simplified. Auditors love Git logs. The operational overhead is real  -  ArgoCD needs monitoring, the infra repo needs maintenance, and there’s a learning curve. But for our scale and requirements, it’s absolutely worth it.

Should You Adopt GitOps?

GitOps makes sense if:

  • You’re running multiple services in Kubernetes

  • You have a team where multiple people deploy

  • Compliance and audit trails matter

  • You want to reduce deployment anxiety

  • You’re tired of “it works in staging” surprises GitOps might be overkill if:

  • You have a single monolith with infrequent deployments

  • You’re a solo developer who can keep state in your head

  • You’re not yet on Kubernetes

  • Your team is very small and deployment coordination is easy The honest answer: the investment in GitOps pays off at a certain scale. Below that scale, simpler solutions work fine. For us, running a payment platform with strict requirements around security, auditability, and reliability  -  GitOps isn’t optional. It’s infrastructure.

Key Takeaways

  • Git as the source of truth eliminates configuration drift and provides automatic audit trails. No more “what changed?” investigations.
  • Separate clusters for dev and prod provide hard isolation that namespaces can’t match. The extra cost is worth the peace of mind.
  • Consistent directory structure across services reduces cognitive load. When everything looks the same, troubleshooting is faster.
  • Workload Identity beats service account keys. Let the cloud provider handle authentication instead of managing credentials yourself.
  • External secret management keeps secrets out of Git while maintaining automation. Don’t roll your own solution here.
  • Graceful shutdown configuration is non-negotiable for payment services. Plan for pod termination from day one.
  • Version-pin your images. Even in dev. latest will burn you eventually.

Build with us

We're building the payments and billing platform for SaaS, AI, and digital products. Come help us ship.

View Open Positions

More Articles