Skip to content
Dodo Payments
Jan 21, 2026 16 min read

Building Webhooks That Never Fail: Our Journey to 99.99%+ Delivery Reliability

Ayush Agarwal
Ayush Agarwal
Co-founder & CPTO
Banner image for Building Webhooks That Never Fail: Our Journey to 99.99%+ Delivery Reliability

At Dodo Payments, webhooks are how our merchants know when money moves. A customer pays for a subscription, we fire a payment.succeeded webhook, and the merchant’s system provisions access. Simple enough - until you realize that a missed webhook means a customer paid but got nothing in return.

We were losing about 0.3% of our webhooks. That doesn’t sound like much until you do the math: at 100,000 payments per day, that’s 300 customers with failed experiences. 300 support tickets. 300 reasons for merchants to question our reliability.

This is the story of how we rebuilt our webhook infrastructure from the ground up, achieving 99.99%+ delivery reliability with sub-500ms latency from event to delivery. We used PostgreSQL triggers, Sequin for change data capture, Kafka for durable messaging, Restate for durable execution, and Svix for last-mile delivery.

Why Traditional Webhook Delivery Fails

Most webhook implementations follow a deceptively simple pattern:

Payment succeeds → Application fires HTTP request → Hope merchant receives it

The problem isn’t the happy path  -  it’s everything else. Deployments that kill retry loops mid-flight. OOM events that wipe in-memory queues. Network partitions that timeout requests while the merchant’s server is still processing.

The core issue: in-process webhook delivery has no durability guarantees.

Your application holds the webhook in memory. If that process dies  -  for any reason  -  the webhook dies with it. No log entry. No retry. No trace it ever existed. The payment succeeded, the database committed, but the notification vanished.

This isn’t an edge case. This is the normal operating condition of any system that deploys regularly, scales dynamically, or occasionally runs out of memory.

The Failure Mode That Haunted Us

Here’s the exact scenario that was happening dozens of times per day:

Timeline of a Lost Webhook
══════════════════════════════════════════════════════════════════════════
14:32:07.000  Payment processor confirms charge


14:32:07.050  API updates payment status in PostgreSQL ──► COMMITTED ✓


14:32:07.100  Async task spawned to fire webhook


14:32:08.000  HTTP request starts to merchant
      │       (merchant server is slow...)

      ⋮       ← Request in flight, waiting for response

14:32:15.000  Kubernetes initiates deployment rollout


14:32:15.001  Pod receives SIGTERM
      │       Grace period: 10 seconds

      ⋮       ← Webhook task still running...

14:32:25.000  Pod killed (SIGKILL)

      ╳       ← Webhook task terminated mid-request

              No error logged.
              No retry scheduled.
              Merchant never receives webhook.
              Payment shows "succeeded" in database.

The failure was invisible by design  -  there was simply nothing to observe.

Why Retries Aren’t Enough

“But we have retries!”  -  Yes, so did we. Here’s why they don’t solve the problem:

Problem 1: Retries Die With The Process
═══════════════════════════════════════════════════════════════════

  Process Memory                          
  ┌─────────────────────────────────┐     
  │  Retry Queue (in-memory)        │     
  │  ┌───────────────────────────┐  │     
  │  │ webhook_1: retry at 14:35 │  │     
  │  │ webhook_2: retry at 14:40 │  │◄──── All retry state lives here
  │  │ webhook_3: retry at 14:45 │  │     
  │  └───────────────────────────┘  │     
  └─────────────────────────────────┘     

                 │  Process crashes / restarts

  ┌─────────────────────────────────┐     
  │  Retry Queue (in-memory)        │     
  │  ┌───────────────────────────┐  │     
  │  │         (empty)           │  │◄──── All retries lost
  │  └───────────────────────────┘  │     
  └─────────────────────────────────┘     

Problem 2: Retries Break Ordering
═══════════════════════════════════════════════════════════════════

Event Order (actual):     A ──► B ──► C

What happens:

    Event A: attempt 1 fails    ───────────► retry scheduled
    Event B: attempt 1 succeeds ───────────► delivered
    Event C: attempt 1 succeeds ───────────► delivered  
    Event A: attempt 2 succeeds ───────────► delivered (late!)
  
Merchant receives:        B ──► C ──► A   ← Wrong order!

Retries handle transient network failures. They don’t handle process failures  -  and that’s where the real data loss happens.

The Mental Shift: Webhooks as Derived Data

The breakthrough came when we stopped thinking about webhooks as “things we send” and started thinking about them as “derived data.”

Old Mental Model (Flawed)
═══════════════════════════════════════════════════════════════════

Payment Code                              Webhook
  ┌──────────────────┐                    ┌──────────────┐
  │ update_payment() │ ──── fires ────►   │ HTTP POST    │
  │                  │                    │ to merchant  │
  └──────────────────┘                    └──────────────┘
         │                                       │
         ▼                                       ▼
    Durable (DB)                          Ephemeral (memory)

  Problem: Webhook existence depends on code execution

New Mental Model (Reliable)
═══════════════════════════════════════════════════════════════════

  Payment Code              Domain Event              Webhook
  ┌──────────────────┐     ┌──────────────┐     ┌──────────────┐
  │ update_payment() │ ──► │ domain_events│ ──► │ HTTP POST    │
  │                  │     │ table (DB)   │     │ to merchant  │
  └──────────────────┘     └──────────────┘     └──────────────┘
         │                        │                    │
         ▼                        ▼                    ▼
    Durable (DB)             Durable (DB)         Durable (Restate)

Key: Every step is backed by durable storage

When a payment succeeds, that fact is durably stored in PostgreSQL. The webhook is just a notification about that durable fact. So why was its existence dependent on the ephemeral state of an application process?

The answer: capture events durably first, then process them with a system designed for durable execution.

This insight led us to a five-layer architecture where each component solves a specific reliability problem  -  without sacrificing speed. Our goal was ambitious: near-real-time delivery (sub-500ms) with five-nines reliability.

The Architecture

┌────────────────────────────────────────────────────────────────┐
│                      POSTGRESQL                                │
│                                                                │
│   payments table                    domain_events table        │
│   ┌─────────────┐                  ┌─────────────────────┐     │
│   │ status:     │   DB TRIGGER     │ id: uuid            │     │
│   │ 'succeeded' │ ───────────────► │ event_type:         │     │
│   │             │   (same tx)      │   'payment.succeeded│     │
│   └─────────────┘                  │ object_id: pay_123  │     │
│                                    └─────────────────────┘     │
└────────────────────────────────────────────────────────────────┘

                              PostgreSQL WAL (durable, ordered)


                                    ┌─────────────────┐
                                    │     Sequin      │
                                    │   (CDC tool)    │
                                    │                 │
                                    │ Reads WAL,      │
                                    │ pushes to Kafka │
                                    └─────────────────┘


                                    ┌─────────────────┐
                                    │     Kafka       │
                                    │                 │
                                    │ Durable buffer  │
                                    │ with replay     │
                                    └─────────────────┘


                                    ┌─────────────────┐
                                    │    Restate      │
                                    │                 │
                                    │ Durable         │
                                    │ execution       │
                                    └─────────────────┘


                                    ┌─────────────────┐
                                    │  Data Service   │
                                    │                 │
                                    │ Payload         │
                                    │ enrichment      │
                                    └─────────────────┘


                                    ┌─────────────────┐
                                    │      Svix       │
                                    │                 │
                                    │ Last-mile       │
                                    │ delivery        │
                                    └─────────────────┘


                                    ┌─────────────────┐
                                    │    Merchant     │
                                    │    Endpoint     │
                                    └─────────────────┘

Each layer solves a specific problem:

Let’s dive into each layer.

Layer 1: Atomic Event Capture with PostgreSQL Triggers

The first problem: ensuring an event exists when something happens. Our old system created events in application code, meaning the event’s existence was tied to that code executing successfully.

The fix: PostgreSQL triggers.

Atomic Event Capture
═══════════════════════════════════════════════════════════════════

Single Database Transaction
  ┌─────────────────────────────────────────────────────────────┐
  │                                                             │
  │   UPDATE payments                  INSERT domain_events     │
  │   SET status = 'succeeded'         (id, event_type,         │
  │   WHERE id = 'pay_123'             object_id, ...)          │
  │         │                                 ▲                 │
  │         │          DB TRIGGER             │                 │
  │         └─────────────────────────────────┘                 │
  │                                                             │
  └─────────────────────────────────────────────────────────────┘


              ┌─────────────────────────┐
              │   COMMIT or ROLLBACK    │
              │                         │
              │  Both succeed together  │
              │  or both fail together  │
              └─────────────────────────┘
CREATE TRIGGER on_payment_status_change
AFTER UPDATE ON payments
FOR EACH ROW
WHEN (OLD.status IS DISTINCT FROM NEW.status 
  AND NEW.status IN ('succeeded', 'failed', 'cancelled', 'processing'))
EXECUTE FUNCTION create_domain_event();

When a payment status changes, the trigger fires within the same transaction. If the transaction commits, both the payment update and the domain event are persisted atomically. If it rolls back, neither exists.

The critical guarantee: if the payment succeeded, the event exists. No race condition. No “payment committed but event creation failed.” The database enforces the invariant.

Our domain_events table is intentionally minimal:

CREATE TABLE domain_events (
  id           UUID PRIMARY KEY,
  business_id  UUID NOT NULL,
  event_type   TEXT NOT NULL,  -- e.g., "payment.succeeded"
  object_id    TEXT NOT NULL,  -- the payment/subscription/refund ID
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

We store what happened and to what, not the full payload. The payload gets fetched later at delivery time. This keeps the table small and the triggers fast.

Layer 2: Change Data Capture with Sequin and Kafka

Events exist in the database. Now we need to get them out reliably.

The naive approach is polling: query domain_events every few seconds, process new rows, mark them handled. But polling has problems:

Polling vs CDC: Why CDC Wins
═══════════════════════════════════════════════════════════════════

POLLING (Traditional)
─────────────────────
  Worker          Database              
    │                │                  
    │── SELECT ────► │    Poll every 5s  
    │◄─── rows ───── │                   
    │                │                  
    │    (5 sec)     │    ← Latency!    
    │                │                  
    │── SELECT ────► │                   
    │◄─── rows ───── │                   
    │                │                  
  Problems:
  • 5 second latency minimum
  • Wasted queries when no new events
  • Complex "processed" flag management
  • Race conditions with multiple workers

CDC via WAL (Our Approach)
──────────────────────────
  PostgreSQL        Sequin          Kafka           
    │                 │               │             
    │── WAL ────────► │               │  Event pushed
    │   (realtime)    │── publish ──► │  instantly on
    │                 │               │  commit
    │                 │               │             
    │── WAL ────────► │── publish ──► │             
    │                 │               │             
  Benefits:
  • Sub-millisecond latency
  • No polling overhead
  • Guaranteed ordering
  • Exactly-once delivery

Change Data Capture (CDC) via WAL replication solves all of these.

What is Sequin?

Sequin connects to PostgreSQL’s logical replication slot and reads the Write-Ahead Log (WAL) in real-time. When a row is inserted, Sequin captures that change and forwards it downstream. This is fundamentally different from polling  -  changes are pushed by the database itself.

Sequin provides:

  • Exactly-once delivery  -  Position only advances after downstream acknowledges
  • Ordering guarantees  -  Events arrive in transaction commit order
  • Sub-millisecond latency  -  Events pushed immediately on commit, no polling delay
  • Zero database load  -  Reading WAL doesn’t add queries

Why Kafka as a Buffer?

We don’t push events directly from Sequin to Restate. Instead, Sequin publishes to Kafka, and Restate subscribes.

Kafka as a Durable Buffer
═══════════════════════════════════════════════════════════════════

Normal Operation:
  ─────────────────
  Sequin ──► Kafka ──► Restate ──► Webhook delivered
             2-5ms     

  When Restate is Slow/Down:
  ──────────────────────────
  Sequin ──► Kafka ──────────────────────────► Restate (recovered)
               │                                    │
               │     Events buffered safely         │
               │     ┌─────────────────────┐        │
               └────►│ msg 1               │────────┘
                     │ msg 2               │
                     │ msg 3               │  ← Can hold millions
                     │ ...                 │     of messages
                     │ msg N               │
                     └─────────────────────┘
  Replay Capability:
  ──────────────────
  Found a bug? Replay from any offset in the last 7 days.

Kafka acts as a durable buffer providing:

  • Backpressure handling  -  If Restate is slow, events accumulate in Kafka rather than being dropped. Kafka can buffer millions of messages without issue.
  • Replay capability  -  Kafka retains messages for 7 days. If we discover a bug in webhook processing, we can replay from a specific offset.
  • Independent scaling  -  Multiple Restate workers can consume from the same topic via consumer groups.
  • Operational isolation  -  A Restate problem doesn’t affect Sequin’s ability to capture events.
# Conceptual configuration
sequin:
  source: domain_events table (WAL)
  destination: kafka://webhook-events

restate:
  source: kafka://webhook-events
  handler: CdcConsumer.handle_event

If Restate fails, Kafka retains the message. If the entire Restate cluster goes down, events accumulate until it recovers. Events don’t get lost.

Layer 3: Durable Execution with Restate

This is where the durability magic happens. Restate is a “durable execution” platform that fundamentally changes how we think about code reliability.

The Problem with Ephemeral Code

Traditional application code is ephemeral  -  if the process dies, any in-progress work is lost. Schedule a retry for 5 minutes from now, pod gets killed at minute 3, that retry never happens.

How Restate Works

Restate inverts this model: every function invocation is logged to a persistent journal before execution. If the process crashes mid-execution, Restate replays the journal and resumes exactly where it left off.

Restate's Durable Execution Model
═══════════════════════════════════════════════════════════════════

Traditional Code (Ephemeral):
  ─────────────────────────────
  Process Memory                 
  ┌────────────────────────────┐ 
  │ 1. Generate event_id       │──► event_id = "evt_abc123"
  │ 2. Call webhook service    │──► Crash here!
  │ 3. Mark complete           │    
  └────────────────────────────┘    


  Process restarts...               
  ┌────────────────────────────┐    
  │ 1. Generate event_id       │──► event_id = "evt_xyz789"  ← NEW ID!
  │ 2. Call webhook service    │──► Duplicate webhook sent
  │ 3. Mark complete           │    
  └────────────────────────────┘    

  Restate (Durable):
  ──────────────────
  Restate Journal (persistent)         Process
  ┌────────────────────────────┐      ┌──────────────────┐
  │ ✓ Input: event data        │      │                  │
  │ ✓ ctx.run: "evt_abc123"    │◄────►│ Execute step 1   │
  │ ○ (pending: webhook call)  │      │ Crash here!      │
  └────────────────────────────┘      └──────────────────┘

           ▼  Process restarts, journal replayed
  ┌────────────────────────────┐      ┌──────────────────┐
  │ ✓ Input: event data        │      │                  │
  │ ✓ ctx.run: "evt_abc123"    │─────►│ Skip (cached)    │
  │ ○ (pending: webhook call)  │◄────►│ Resume step 2    │
  └────────────────────────────┘      └──────────────────┘

  Same event_id used! No duplicate webhooks.

Think of it like a database transaction, but for arbitrary code. The journal records:

  • Function inputs
  • Results of side effects (HTTP calls, random number generation, etc.)
  • Retry state and timing When replayed after a crash, Restate doesn’t re-execute side effects  -  it reads recorded results from the journal. Your code executes identically whether fresh or replayed. This is called deterministic execution.

Our Webhook Handler

# Pseudocode - actual implementation in Rust
async def handle_cdc_event(ctx, event, is_live):
    # Generate event ID - journaled via ctx.run()
    # On replay, returns the SAME ID, not a new one
    event_id = await ctx.run(lambda: generate_unique_id())

    # Fan out to webhook sender (async, durable)
    await ctx.service_call(WebhookSender.send, {
        "event_id": event_id,
        "event": event,
        "is_live": is_live
    })

The ctx.run() wrapper is crucial. Anything inside ctx.run() gets journaled. If we crash after generating the event ID but before calling the webhook sender, replay uses the same event ID. This is how we achieve stable idempotency keys across crashes.

Durable Retries

Retry State: In-Memory vs Durable
═══════════════════════════════════════════════════════════════════

In-Memory Retries (Traditional):
  ────────────────────────────────
  Attempt 1 fails ──► Schedule retry in 6s ──► Process dies

                            ╳  Retry lost forever

  Durable Retries (Restate):
  ──────────────────────────
  Attempt 1 fails ──► Schedule retry in 6s ──► Process dies


                    ┌─────────────────┐
                    │ Restate Journal │
                    │ ─────────────── │
                    │ retry_at: 14:35 │ ← Persisted!
                    │ attempt: 2      │
                    └─────────────────┘


                    Process restarts


                    Retry fires at 14:35 ✓
@restate.service
class WebhookSender:
    @restate.handler(
        retry_policy=RetryPolicy(
            initial_delay=timedelta(seconds=6),
            max_delay=timedelta(hours=6),
            backoff_multiplier=10,
            max_attempts=10
        )
    )
    async def send(ctx, request):
        response = await ctx.run(
            lambda: http_post(DATA_SERVICE_URL, request)
        )
        if not response.ok:
            raise RetryableError("Delivery failed")

If the HTTP call fails, Restate schedules a retry using exponential backoff. The retry schedule is written to the journal immediately. Crash while waiting? Retry still fires. Deploy a new version? Picks up where it left off.

This is the key difference: Restate’s retry state is durable, not in-memory.

Layer 4: Data Enrichment and Last-Mile Delivery

The webhook sender doesn’t deliver directly to merchants. It calls our Data Service for payload enrichment, then hands off to Svix for final delivery.

The Data Service

Why fetch data at delivery time instead of storing it with the event?

Because merchants want current state, not historical state. If a payment gets refunded between event creation and delivery, the merchant wants to know. Fetching fresh data ensures the payload reflects reality at delivery time.

def create_webhook_payload(business_id, event_type, object_id):
    # Fetch fresh data from database
    data = match event_type:
        case "payment.*":      fetch_payment(business_id, object_id)
        case "subscription.*": fetch_subscription(business_id, object_id)
        case "refund.*":       fetch_refund(business_id, object_id)
        case "dispute.*":      fetch_dispute(business_id, object_id)

    return WebhookPayload(
        business_id=business_id,
        event_type=event_type,
        timestamp=now(),
        data=data
    )

Svix for Last-Mile Delivery

Svix handles everything after payload construction:

The idempotency key we pass to Svix is the event_id generated in Restate’s ctx.run() block - stable across retries and replays. If Restate retries a failed Svix call, Svix recognizes the duplicate and returns success without re-delivering.

We’re not in the webhook delivery business. Svix is. They’ve seen more webhook failure modes than we ever will.

Results

After deploying this architecture:

Latency Breakdown: How We Achieve Sub-500ms Delivery
═══════════════════════════════════════════════════════════════════

Event Timeline (P50):
  0ms        100ms      200ms      300ms      400ms      500ms
  │          │          │          │          │          │
  ▼          ▼          ▼          ▼          ▼          ▼
  ├──────────┼──────────┼──────────┼──────────┼──────────┤
  │          │          │          │          │          │
  │ Trigger  │ Sequin   │ Kafka    │ Restate  │ Svix     │
  │ <1ms     │ ~50ms    │ ~5ms     │ ~150ms   │ ~200ms   │
  │          │          │          │          │          │
  └──────────┴──────────┴──────────┴──────────┴──────────┘
  DB commit  WAL read   Publish/   Journal +  Sign +
  + trigger  + publish  Subscribe  enrich     deliver

  Reliability Comparison:
                          Before                 After
                     ┌─────────────┐        ┌─────────────┐
  Delivered          │▓▓▓▓▓▓▓▓▓▓▓  │        │▓▓▓▓▓▓▓▓▓▓▓▓ │
  (99.97%)           │▓▓▓▓▓▓▓▓▓▓▓  │        │▓▓▓▓▓▓▓▓▓▓▓▓ │
                     │▓▓▓▓▓▓▓▓▓▓▓  │        │▓▓▓▓▓▓▓▓▓▓▓▓ │ (99.99%+)
                     │▓▓▓▓▓▓▓▓▓▓▓░ │        │▓▓▓▓▓▓▓▓▓▓▓▓ │
  Lost (0.03%)       │           ░ │        │             │ (<0.01%)
                     └─────────────┘        └─────────────┘
  At 100K webhooks/day:
  • Before: ~300 lost/day
  • After:  <10 lost/day (merchant endpoint issues)

The latency improvement surprised us. Despite adding more components to the pipeline, the P50 latency dropped to under 500ms. The key factors:

  • CDC is faster than polling  -  Sequin pushes events immediately on WAL commit, no polling interval
  • Kafka adds negligible latency  -  For small messages, Kafka’s latency is typically 2–5ms
  • Restate is optimized for low-latency  -  The journaling overhead is minimal for simple workflows
  • Parallel execution  -  Events flow through the pipeline without blocking on each other For merchants building real-time experiences  -  like instantly provisioning access after payment  -  this near-real-time delivery is critical. A webhook that arrives in 500ms feels instantaneous. One that arrives in 2+ seconds creates noticeable lag in the user experience.

The reliability improvement sounds small in percentage terms. In absolute terms: we went from losing ~300 webhooks per 100,000 payments to losing essentially none. The remaining failures are merchant endpoint issues, not ours.

More importantly, we can now prove webhooks were delivered. Every event has an audit trail through Sequin, Kafka, Restate, and Svix. When a merchant says “I didn’t get the webhook,” we can show them exactly what happened.

Should You Build This?

This architecture adds operational complexity. You’re now running CDC infrastructure, a message queue, a durable execution platform, and a webhook delivery service. That’s a lot of moving parts.

Build this if:

  • Webhook reliability is business-critical. For payments, a missed payment.succeeded webhook means customers paid but got nothing. The cost of failure is high.

  • You’re at scale where failures are statistically inevitable. At 1,000 webhooks/day, 0.3% loss is 3 webhooks  -  annoying but manageable. At 100,000/day, it’s 300 daily failures and a support nightmare.

  • Your events derive from durable state. This pattern assumes events are notifications about database changes. If your events are fire-and-forget commands, CDC won’t help. Don’t build this if:

  • You’re early stage. A simple retry queue is fine. Ship features first.

  • Your merchants tolerate occasional failures. Some integrations don’t need five-nines reliability.

  • You can’t operate the additional systems. Each component adds failure modes. Ensure you have the expertise.

Key Takeaways

The Reliability Stack: What Each Layer Guarantees
═══════════════════════════════════════════════════════════════════

  ┌─────────────────────────────────────────────────────────────────┐
  │                                                                 │
  │   PostgreSQL Trigger                                            │
  │   └─► "If payment committed, event exists"                      │
  │                                                                 │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │   Sequin (CDC)                                                  │
  │   └─► "Every committed event reaches Kafka, in order"           │
  │                                                                 │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │   Kafka                                                         │
  │   └─► "Events are buffered durably, replayable for 7 days"      │
  │                                                                 │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │   Restate                                                       │
  │   └─► "Processing will complete, even across crashes/deploys"   │
  │                                                                 │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │   Svix                                                          │
  │   └─► "Delivery will be retried until merchant acknowledges"    │
  │                                                                 │
  └─────────────────────────────────────────────────────────────────┘
  Combined guarantee: Events flow from database to merchant 
                      with exactly-once semantics and sub-500ms latency
  • In-process webhook delivery is fundamentally unreliable. If your process can die, your webhooks can be lost. Design around this limitation.
  • Capture events in the database, not application code. Triggers ensure events exist atomically with state changes.
  • Use CDC for exactly-once streaming. WAL replication gives durability and ordering without complex checkpointing.
  • Durable execution solves “crashed mid-retry.” Restate, Temporal, or similar platforms journal execution state so retries survive restarts.
  • Stable idempotency keys require careful attention. One key per event, not per attempt. Generate once, journal it, reuse on replay.
  • Fetch payload data at delivery time. Merchants want current state. Store minimal event metadata, enrich at delivery. Webhooks don’t have to be unreliable. The tools exist to build systems that genuinely never lose events. It just requires acknowledging that “fire and forget” was never acceptable for critical notifications.

Build with us

We're building the payments and billing platform for SaaS, AI, and digital products. Come help us ship.

View Open Positions

More Articles