Jan 14, 2026 14 min read

Why We Moved from Nginx to Istio: Dodo Payments Journey to Service Mesh

Ayush Agarwal

Co-founder & CPTO

Banner image for Why We Moved from Nginx to Istio: Dodo Payments Journey to Service Mesh

Build with us

We're building the payments and billing platform for SaaS, AI, and digital products. Come help us ship.

View Open Positions

It started with a Slack message at 11 PM on a Thursday.

“API is returning 429s for legitimate users. Big customer affected.”

Our rate limiter was doing exactly what we told it to do - and that was the problem.

Setting the Scene: Our Stack Before Istio

We’re Dodo Payments, a payment infrastructure company. Our platform processes transactions for businesses across multiple countries, handling everything from payment links to subscriptions to payouts.

Our architecture looked like what you’d expect from a modern Kubernetes-based platform:

Internet → Cloudflare → Nginx Ingress → Services → Databases

Cloudflare handled DDoS protection and SSL termination at the edge. Nginx Ingress Controller routed traffic into our Kubernetes cluster. Individual services handled business logic.

This setup served us well through our early growth. Nginx is battle-tested, well-documented, and has a massive community. We could configure routing rules, add basic rate limiting, and handle SSL termination without much friction.

But payment platforms have unique challenges. We’re not serving static content or running a typical SaaS application. Every request potentially moves money. The security requirements are strict. The abuse patterns are sophisticated. And the cost of getting it wrong is measured in real dollars - both ours and our customers’.

As we scaled, three problems kept getting worse.

Problem #1: The Rate Limiting Paradox

Payment APIs are magnets for abuse. Card testing attacks, credential stuffing, enumeration attempts - we see it all. Rate limiting isn’t optional; it’s a survival requirement.

With Nginx, we had one primary lever: rate limit by IP address.

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

Simple enough. But here’s where it breaks down.

The Corporate Network Problem

One of our larger customers runs their integration from a corporate office. Fifty developers, all hitting our sandbox API, all sharing the same public IP. They’d hit our rate limits constantly during normal development.

We’d bump up the limits for their IP range. Then a month later, another customer would have the same problem.

The Distributed Attack Problem

Meanwhile, attackers were rotating through thousands of IPs using residential proxy networks. Our IP-based limits were useless against them. By the time we blocked one IP, they’d moved to the next.

The Legitimate Burst Problem

Some customers have legitimately spiky traffic patterns. A SaaS company running end-of-month billing might send thousands of requests in a short window. That’s not abuse - that’s their business model. But our rate limiter couldn’t tell the difference.

What we actually needed was rate limiting based on identity, not network topology:

Authenticated users should get their own quota
Different customer tiers should have different limits
Specific high-volume endpoints need separate treatment
Unknown/unauthenticated requests should be treated conservatively Nginx can do this with Lua scripting and external data stores. We tried. The code became fragile, hard to test, and scary to modify. Every change to rate limiting logic felt like defusing a bomb.

Problem #2: Authentication Sprawl

Our authentication requirements evolved faster than our infrastructure could keep up.

We had:

API keys for server-to-server integrations
JWTs for our dashboard and admin interfaces
Webhook signatures for incoming payment provider callbacks
Internal service tokens for service-to-service communication Each authentication method had different validation logic, different header extraction patterns, and different failure modes.

With Nginx, we were handling some auth at the ingress layer and some in application code. The split was inconsistent. Some services validated JWTs themselves; others relied on the ingress to do it. When we needed to change auth logic, we’d have to touch multiple places and hope we didn’t miss anything.

Worse, there was no single place to answer the question: “What authentication is required for this endpoint?”

The configuration was scattered across Nginx configs, application code, and middleware. Auditing was painful. Mistakes were easy to make and hard to catch.

Problem #3: The Observability Blindspot

When something went wrong, we struggled to answer basic questions:

What’s the p99 latency for authenticated vs. unauthenticated requests?
Which customer accounts are generating the most errors?
How does latency break down by geographic region?
What’s the request volume for this specific endpoint over the last hour? Nginx gives you access logs. That’s a start. But turning access logs into actionable insights requires a pipeline: parsing, enrichment, aggregation, visualization.

We had Prometheus scraping Nginx metrics, but the metrics were too coarse. We knew total request counts, but we couldn’t slice by the dimensions that mattered to our business.

We wanted dashboards that showed request patterns by customer tier, error rates by endpoint category, latency distributions by authentication method. Building this on top of Nginx logs was possible but required significant custom tooling.

The Turning Point

The Thursday night incident forced a conversation we’d been avoiding.

A legitimate customer was getting rate-limited because they shared IP ranges with another customer who was running aggressive integration tests. Both customers were paying us. Neither was doing anything wrong. Our infrastructure couldn’t tell them apart.

We patched it by adding IP allowlists and customer-specific overrides. But we knew this was unsustainable. Every edge case added another conditional. The configuration was becoming a maintenance nightmare.

We needed a fundamentally different approach.

Enter Service Mesh

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. Instead of each service implementing its own networking logic (retries, timeouts, circuit breakers, auth), you offload that responsibility to a proxy that runs alongside each service.

The dominant pattern uses sidecar proxies - each pod gets an additional container (typically Envoy) that intercepts all network traffic. These proxies are coordinated by a control plane that distributes configuration.

┌─────────────────────────────────────────┐
│                Control Plane            │
│     (Configuration, Certificates)       │
└─────────────────────┬───────────────────┘
                      │
     ┌────────────────┼────────────────┐
     ▼                ▼                ▼
┌─────────┐     ┌─────────┐     ┌─────────┐
│ Service │     │ Service │     │ Service │
│    A    │────▶│    B    │────▶│    C    │
│ [Envoy] │     │ [Envoy] │     │ [Envoy] │
└─────────┘     └─────────┘     └─────────┘

The key insight: networking becomes programmable infrastructure, not application code.

We evaluated Istio, Linkerd, and Consul Connect. Istio won for us because:

Envoy as the data plane: Envoy is battle-tested, extensible, and has excellent rate limiting support
Rich traffic management: VirtualServices and DestinationRules gave us the routing flexibility we needed
Security model: Built-in mTLS, JWT validation, and authorization policies
Ecosystem: Strong Kubernetes integration and active community

Deep Dive: How Istio Solved Our Problems

Multi-Dimensional Rate Limiting

This was the killer feature for us.

Istio doesn’t handle rate limiting directly - it delegates to Envoy’s external rate limiting service. But this turns out to be incredibly powerful.

The architecture works like this:

Request → Envoy → Rate Limit Service → Redis
                         │
                         ▼
                  Allow/Deny Decision

The magic is in descriptors. Instead of just “limit by IP,” you define a hierarchy of attributes to extract from each request:

actions:
- request_headers:
    header_name: "x-user-id"
    descriptor_key: "user_id"
- request_headers:
    header_name: "x-customer-tier"
    descriptor_key: "tier"
- remote_address: {}

This extracts user ID, customer tier, and IP address from each request. The rate limit service then applies rules based on these descriptors:

descriptors:
- key: user_id
  rate_limit:
    requests_per_unit: 100
    unit: second

- key: tier
  value: "enterprise"
  rate_limit:
    requests_per_unit: 500
    unit: second

- key: tier
  value: "starter"
  rate_limit:
    requests_per_unit: 50
    unit: second

Now our rate limiting actually makes sense:

Each authenticated user gets their own bucket
Enterprise customers get higher limits than starter customers
Unknown requests fall back to IP-based limiting with conservative defaults The corporate network problem? Solved. Each developer has their own user ID; they don’t compete for quota.

The distributed attack problem? Much harder to abuse. Attackers would need valid credentials, not just rotating IPs.

The legitimate burst problem? We can assign customers to appropriate tiers based on their needs.

JWT Validation at the Edge

Istio can validate JWTs before requests reach your services. This seems simple but has profound implications.

jwtRules:
- issuer: "https://auth.example.com"
  jwksUri: "https://auth.example.com/.well-known/jwks.json"
  outputClaimToHeaders:
  - header: "x-user-id"
    claim: "sub"
  - header: "x-user-role"
    claim: "role"

When a request arrives with a valid JWT:

Envoy fetches the signing keys from the JWKS endpoint (cached)
Validates the token signature and expiration
Extracts claims into headers
Forwards the request with enriched headers Your backend services receive x-user-id: user_123 and x-user-role: admin as headers they can trust implicitly. No JWT library needed. No signature verification code. The ingress layer guarantees these headers are authentic.

Invalid tokens get rejected at the edge - they never touch your application code.

Authorization Policies as Code

This is where security becomes auditable.

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: admin-access
spec:
  rules:
  - to:
    - operation:
        paths: ["/admin/*"]
    when:
    - key: request.headers[x-user-role]
      values: ["admin", "superadmin"]
  - to:
    - operation:
        methods: ["GET"]
        paths: ["/api/public/*"]

This policy says:

/admin/* paths require x-user-role header with value admin or superadmin
/api/public/* paths allow GET requests without restrictions When someone asks “who can access the admin endpoints?”, you point them to this YAML file. It’s version controlled, reviewed in PRs, and enforced consistently across the cluster.

Compare this to the previous world: auth logic scattered across Nginx configs, application middleware, and service code. The AuthorizationPolicy resource gives you a single source of truth.

Observability That Actually Helps

Envoy proxies emit detailed telemetry by default:

Request metrics: Count, latency histograms, response codes - broken down by source, destination, and method
TCP metrics: Connection counts, bytes transferred, connection duration
Control plane metrics: Configuration sync status, certificate expiration But here’s what made the difference for us: Envoy metrics include request headers as dimensions.

We tag requests with customer tier, authentication method, and business logic categories. Prometheus scrapes these metrics. Grafana dashboards slice by dimensions that matter:

“Show me p99 latency for enterprise customers hitting the payments endpoint”
“What’s the error rate for webhook callbacks vs. API requests?”
“How does traffic volume compare across customer tiers over the last week?” These questions went from “build a custom log analysis pipeline” to “write a PromQL query.”

Distributed tracing came almost for free. Envoy propagates trace headers automatically. Connect Jaeger or Zipkin, and you get end-to-end request traces across service boundaries.

The Implementation Journey

We didn’t flip a switch and migrate overnight. Here’s how we approached it.

Phase 1: Shadow Mode (2 weeks)

We deployed Istio alongside our existing Nginx ingress. Both received traffic, but Nginx remained the source of truth. Istio was in observation mode - we watched metrics, tested configurations, and learned the operational model.

Key learnings:

Envoy’s configuration model is different from Nginx. Things that were simple in Nginx (like regex-based routing) required different patterns in Istio.
Resource consumption was higher than expected. Envoy sidecars are lightweight individually, but they add up.
istioctl analyze became our best friend for catching configuration errors before deployment.

Phase 2: Non-Critical Paths (3 weeks)

We migrated internal tools and non-critical endpoints first. Our admin dashboard, internal APIs, development environments - traffic that wouldn’t cause customer impact if something went wrong.

This phase uncovered edge cases:

Some services expected specific header formats that Envoy normalized differently
Timeouts needed tuning - Envoy’s defaults were more aggressive than Nginx
Health check endpoints needed special handling to avoid authentication requirements

Phase 3: Production Traffic (4 weeks)

We migrated production traffic incrementally using traffic splitting. Start with 1%, watch metrics, increase to 10%, watch again, and so on.

route:
- destination:
    host: api-service
  weight: 90
- destination:
    host: api-service-istio
  weight: 10

The gradual rollout caught issues that testing hadn’t:

Certain clients sent malformed headers that Nginx tolerated but Envoy rejected
Some webhook providers had unusual retry patterns that triggered rate limits
Geographic latency varied - Envoy added a few milliseconds that mattered for some use cases

Phase 4: Full Migration (2 weeks)

Once we hit 100% traffic through Istio, we kept Nginx running as a fallback for two more weeks. Then we decommissioned it.

Total timeline: about 11 weeks from first deployment to full migration.

What We Got Wrong

Over-Engineering Rate Limits

Our first rate limiting configuration was too complex. We had different limits for every combination of endpoint, customer tier, authentication method, and time of day. It was impossible to reason about.

We simplified aggressively. Three tiers (starter, growth, enterprise), a few endpoint categories (standard, high-volume, sensitive), and clear fallback behavior. Complexity went down, predictability went up.

Ignoring Failure Modes

Early on, we configured rate limiting to fail closed - if Redis was unavailable, reject all requests. This seemed like the safe choice.

Then Redis had a network blip during a deployment. For 30 seconds, we rejected every request to our API. Including payment confirmations. Not great.

Now we fail open for most endpoints (allow traffic if rate limiting is unavailable) and fail closed only for truly sensitive operations. The threat model matters: an attacker exploiting a brief Redis outage is less damaging than us blocking legitimate payments.

Underestimating the Learning Curve

Istio’s abstraction layers take time to internalize:

VirtualService: Routing rules for HTTP traffic
DestinationRule: Load balancing, connection pool settings, outlier detection
Gateway: Ingress/egress configuration
ServiceEntry: External service definitions
EnvoyFilter: Direct Envoy configuration patches
AuthorizationPolicy: Access control rules
RequestAuthentication: JWT validation rules Each has its own semantics, its own edge cases, and its own failure modes. We made mistakes that would have been obvious with more experience.

Budget more time for learning than you think you need.

The Results

Six months post-migration, here’s where we stand:

Rate limiting that makes sense: Customer complaints about rate limits dropped by 80%. We’re blocking more abuse while affecting fewer legitimate users.
Security posture improved: All authentication happens at the edge. Authorization policies are auditable and version-controlled.
Observability transformed: We have dashboards that actually answer business questions. MTTR (mean time to resolution) for production incidents dropped significantly because we can pinpoint issues faster.
Developer experience better: New services get rate limiting, authentication, and observability automatically. Teams don’t need to implement networking concerns; they inherit them from the platform.
Operational overhead higher: This is the tradeoff. Istio is more complex to operate than Nginx. We’ve invested in training, runbooks, and tooling. The control plane needs monitoring. Upgrades require careful planning.

Should You Make This Move?

Service meshes aren’t for everyone. Here’s how to think about it:

Istio makes sense if:

You need sophisticated traffic management (rate limiting by identity, traffic splitting, circuit breakers)
Security and compliance are critical to your business
You have multiple teams deploying services and want consistent networking policies
Observability gaps are causing real operational pain
You’re willing to invest in learning and operating additional infrastructure Stick with Nginx if:
Your routing requirements are straightforward
IP-based rate limiting is sufficient
You have a small team and can’t absorb the operational overhead
Your services are relatively homogeneous and don’t need differentiated policies The honest answer: if Nginx is working for you today and you don’t recognize the problems we described, you probably don’t need a service mesh yet. Add it when the pain is real, not when it seems like the “modern” choice.

For us, running a payment platform with strict security requirements, complex rate limiting needs, and high observability demands - Istio was the right call. The investment paid off.

Key Takeaways

Rate limiting by identity beats rate limiting by IP for any API where users authenticate. The implementation is more complex, but the behavior is dramatically better.
Authentication at the edge simplifies everything downstream. When your gateway guarantees header authenticity, backend services can trust without verifying.
Authorization policies as code makes security auditable. “Who can access what” should be answerable by reading a YAML file, not tracing code paths.
Fail open vs. fail closed is a critical design decision. Think through your threat model before deciding.
Gradual migration is essential. Shadow mode, non-critical paths, incremental traffic shifts - each phase catches different issues.
Operational overhead is real. Service meshes add complexity. Make sure the benefits justify the cost for your specific situation. We’re building payment infrastructure at Dodo Payments. If distributed systems, fintech, and platform engineering sound interesting, we’re hiring.