# How Replicate Handles Billing: A Complete Breakdown

> A detailed analysis of Replicate's pure usage-based billing model - per-second compute pricing across hardware tiers, cold start costs, and how to build the same pay-per-second infrastructure billing for your own AI platform.
- **Author**: Ayush Agarwal
- **Published**: 2026-04-09
- **Category**: AI, Billing
- **URL**: https://dodopayments.com/blogs/replicate-billing-model

---

Most AI platforms hedge their bets. They bundle compute into subscription tiers, sell token packs, or hide GPU costs behind per-request pricing. Replicate does none of that. It charges per second of compute time, with rates that change based on which hardware your model runs on. No monthly fee. No minimum commitment. No per-model pricing.

This makes Replicate one of the purest examples of [usage-based billing](https://dodopayments.com/blogs/usage-based-billing-vs-flat-fees-ai-saas) in production today. Where [Midjourney wraps GPU time in subscription tiers](https://dodopayments.com/blogs/midjourney-billing-model) and [Cursor bundles tokens into dollar-denominated pools](https://dodopayments.com/blogs/cursor-billing-model), Replicate strips everything back to a single billing primitive: hardware multiplied by seconds.

This post breaks down how Replicate's billing model works, what makes it effective for AI infrastructure, where it creates friction, and how to build the same per-second compute billing using [Dodo Payments](https://dodopayments.com). For a full technical implementation guide with code samples, see the [Replicate Billing Deconstruction](https://docs.dodopayments.com/developer-resources/billing-deconstructions/replicate) in Dodo's documentation.

## Replicate's Pricing by Hardware Tier

Replicate's pricing page is refreshingly simple. There are no plans to compare and no feature gates to navigate. The entire pricing model fits in one table: the cost per second for each hardware option.

**Standard Hardware**

| Hardware               | Price/sec | Price/hr | GPU RAM | System RAM |
| :--------------------- | :-------- | :------- | :------ | :--------- |
| CPU (Small)            | $0.000025 | $0.09    | -       | 2 GB       |
| CPU                    | $0.000100 | $0.36    | -       | 8 GB       |
| Nvidia T4 GPU          | $0.000225 | $0.81    | 16 GB   | 16 GB      |
| Nvidia L40S GPU        | $0.000975 | $3.51    | 48 GB   | 65 GB      |
| Nvidia A100 (80GB) GPU | $0.001400 | $5.04    | 80 GB   | 144 GB     |
| Nvidia H100 GPU        | $0.001525 | $5.49    | 80 GB   | 72 GB      |

**Multi-GPU Configurations**

| Hardware              | Price/sec | Price/hr |
| :-------------------- | :-------- | :------- |
| 2x Nvidia L40S        | $0.001950 | $7.02    |
| 2x Nvidia A100 (80GB) | $0.002800 | $10.08   |
| 2x Nvidia H100        | $0.003050 | $10.98   |
| 4x Nvidia A100 (80GB) | $0.005600 | $20.16   |
| 4x Nvidia H100        | $0.006100 | $21.96   |
| 8x Nvidia A100 (80GB) | $0.011200 | $40.32   |
| 8x Nvidia H100        | $0.012200 | $43.92   |

The multi-GPU configurations (4x and 8x) require committed spend contracts. Everything else is available on demand to any user with a credit card on file.

Notice how the pricing scales linearly with GPU count. Two A100s cost exactly twice what one A100 costs. There is no volume discount baked into the hardware tiers themselves - the discount comes from Replicate's enterprise team for high-volume customers.

## How Replicate's Billing Works

The billing mechanics are straightforward in concept but have several nuances that affect real-world costs.

### Hardware-Specific, Model-Agnostic Rates

Replicate hosts thousands of open-source models contributed by the community. Rather than pricing each model individually, they decouple billing from the model entirely. Whether you run Stable Diffusion, Llama 3, or Whisper, the cost depends only on which hardware the model runs on and how long it runs.

This lets Replicate add new models without touching the pricing system. A contributor can upload a new model tomorrow, and billing already works because it tracks the hardware layer, not the model layer.

```mermaid
flowchart LR
    A["API Request"] --> B["Model Loaded
on Hardware"]
    B --> C{"Hardware Tier?"}
    C -->|"CPU"| D["$0.000100/sec"]
    C -->|"T4 GPU"| E["$0.000225/sec"]
    C -->|"L40S GPU"| F["$0.000975/sec"]
    C -->|"A100 80GB"| G["$0.001400/sec"]
    C -->|"H100"| H["$0.001525/sec"]
    D --> I["Seconds x Rate
= Total Cost"]
    E --> I
    F --> I
    G --> I
    H --> I
```

Some proprietary models use input/output-based pricing instead - Claude 3.7 Sonnet charges per token, Flux 1.1 Pro charges per image. But for the vast majority of open-source models, it is hardware-times-seconds.

### Per-Second Granularity

Traditional cloud providers bill by the hour or by the minute. Replicate bills by the second. For AI workloads where a typical inference takes 3-30 seconds, this granularity eliminates massive billing waste.

Consider a simple comparison. An image generation that takes 8 seconds on an A100 costs $0.0112 on Replicate. On a platform that rounds up to the nearest minute, that same 8-second job costs $0.084 - over 7x more. For [pay-as-you-go AI platforms](https://dodopayments.com/blogs/pay-as-you-go-ai-saas) where users run hundreds of small inferences per day, per-second billing is the difference between a viable business model and one that bleeds margin.

### Cold Starts Are Billable

This is the detail that catches most new users off guard. When a public model has not been used recently, Replicate needs to load it into GPU memory before it can process requests. This cold start typically takes 10-30 seconds, and that loading time is billed at the same per-second rate as actual execution.

For a model running on an A100, a 20-second cold start adds $0.028 to the first request. Subsequent requests (while the model stays warm) avoid this cost entirely. If your application sends sporadic requests to many different models, cold start costs can actually exceed execution costs.

Replicate mitigates this with a warm-up period: models stay loaded for a short window after the last request. But if traffic is bursty or spread across many models, cold starts become a meaningful line item.

### Public Models vs. Private Models

Replicate distinguishes between two deployment modes, and the billing rules change significantly.

**Public models** share infrastructure with other users. You only pay for the time your request is actively being processed (plus cold starts). When your request finishes, the hardware serves someone else.

**Private models** run on dedicated hardware. You pay for all time the instance is online: setup time, idle time, and active processing time. If your private model sits idle for an hour, you pay the full hourly rate. The tradeoff is dedicated capacity with no queue contention and no cold starts.

The exception is "fast booting fine-tunes" - private fine-tuned models that use shared infrastructure and only bill for active processing time.

> When you bill for infrastructure usage directly - compute time, hardware tier, seconds consumed - you create a billing model that scales honestly with the value you deliver. There is no scenario where a customer overpays for resources they did not use.
>
> - Ayush Agarwal, Co-founder & CPTO at Dodo Payments

## What Makes Replicate's Model Effective

Several properties of this billing model make it well-suited for AI infrastructure platforms.

### Zero Commitment Lowers the Entry Barrier

There is no monthly fee, no annual contract, and no minimum spend. A developer can sign up, run one prediction for $0.01, and walk away. This is why Replicate has become a go-to prototyping platform for AI developers. The absence of a subscription removes the friction that kills experimentation.

For comparison, [subscription-based billing](https://dodopayments.com/blogs/subscription-model-best-practices) forces users to commit before they know whether the product fits their workflow. Replicate sidesteps this entirely. Users self-select into higher spending as they find models that work for their use case.

### Cost Transparency Builds Trust

Users can calculate exactly what any job will cost before running it. Replicate shows cost estimates on each model's page. An 8-second SDXL generation on an A100 costs approximately $0.0112. A 3-second Llama 3 inference costs approximately $0.0035. There are no hidden fees, no platform surcharges, and no mystery line items.

By tying cost directly to the observable variable (seconds on hardware), Replicate makes costs predictable even when workloads are not.

### Hardware Choice Gives Users Cost Control

Users are not locked into a single hardware tier. Many models support multiple hardware options, and users can choose between speed and cost. Running Whisper on a T4 GPU costs $0.000225/sec but takes longer. Running it on an A100 costs $0.001400/sec but finishes faster. This mirrors [value-based pricing](https://dodopayments.com/blogs/value-based-pricing-saas) principles: users who need faster results pay more per second but consume fewer total seconds.

### Model-Agnostic Billing Scales the Ecosystem

Because pricing is decoupled from models, Replicate can host a massive model library without managing per-model pricing. Community contributors upload models without negotiating pricing terms. Platforms that price per model (like [OpenAI's tiered token pricing](https://dodopayments.com/blogs/openai-billing-model)) need to set and update rates for every model they offer. Replicate delegates this complexity to the hardware layer, keeping the pricing system simple even as the model catalog grows into the tens of thousands.

## Where Replicate's Model Creates Friction

No billing model is perfect. Replicate's pure usage-based approach has tradeoffs that affect both the platform and its users.

### Unpredictable Monthly Bills

The same property that makes Replicate flexible - no subscription, pure usage - also makes costs hard to forecast. A team running batch processing jobs might spend $50 one month and $500 the next depending on project demands. For companies that need budget predictability, this variability creates friction with finance teams and procurement processes.

This is the central tension in [usage-based versus flat-fee billing](https://dodopayments.com/blogs/usage-based-billing-vs-flat-fees-ai-saas). Pure usage models reward efficiency and punish waste, but they also make financial planning harder. Most mature platforms eventually add spending caps or budget alerts to address this.

### Cold Start Costs Hurt Low-Frequency Users

Developers who use many different models infrequently pay a cold start penalty on almost every request. A workflow that touches 10 different models in a day might spend more on cold starts than on actual inference. The billing model implicitly favors concentrated, high-frequency usage patterns over broad, exploratory ones.

### No Built-in Retention Mechanics

Compare Replicate to [Midjourney's approach](https://dodopayments.com/blogs/midjourney-billing-model): when Midjourney subscribers exhaust their fast GPU hours, they fall back to Relax Mode rather than being cut off. [Cursor offers similar mechanics](https://dodopayments.com/blogs/cursor-billing-model) with its tiered usage pools. Replicate has no equivalent. When users stop sending requests, spending drops to zero. There is no monthly subscription creating switching costs, no [billing credits](https://dodopayments.com/blogs/billing-credits-pricing-cashflow) encouraging prepayment, and no lock-in beyond the platform's model library.

### Private Model Billing Can Be Expensive

The always-on billing for private models means you pay even when the hardware sits idle. A single A100 instance running 24/7 costs $120.96/day or roughly $3,629/month - even if it only processes requests for a few hours. Fast booting fine-tunes help, but they are only available for specific model architectures.

## Build Replicate's Billing Model with Dodo Payments

You can build the same per-second, hardware-tiered billing model using [Dodo Payments' usage-based billing](https://docs.dodopayments.com/features/usage-based-billing/introduction) features. The key is creating separate [meters](https://docs.dodopayments.com/features/usage-based-billing/meters) for each hardware tier and sending [usage events](https://docs.dodopayments.com/features/usage-based-billing/event-ingestion) when model executions complete.

### Step 1: Create Meters for Each Hardware Tier

Each hardware type has a different cost per second, so you need independent meters to track and price them separately.

| Meter Name            | Event Name            | Aggregation | Property            |
| :-------------------- | :-------------------- | :---------- | :------------------ |
| CPU Compute           | `compute.cpu`         | Sum         | `execution_seconds` |
| GPU T4 Compute        | `compute.gpu_t4`      | Sum         | `execution_seconds` |
| GPU L40S Compute      | `compute.gpu_l40s`    | Sum         | `execution_seconds` |
| GPU A100 80GB Compute | `compute.gpu_a100_80` | Sum         | `execution_seconds` |
| GPU H100 Compute      | `compute.gpu_h100`    | Sum         | `execution_seconds` |

The `Sum` aggregation on `execution_seconds` calculates total compute time per hardware tier over each billing period. Set a free threshold of 0 for all meters - every second is billable.

### Step 2: Attach Meters to a Usage-Based Product

Create a product in the [Dodo Payments](https://dodopayments.com) dashboard with:

- **Pricing type**: Usage-Based Billing
- **Base price**: $0/month (no subscription fee, matching Replicate's model)
- **Billing frequency**: Monthly

Attach each meter with its per-unit price:

| Meter                 | Price Per Second |
| :-------------------- | :--------------- |
| `compute.cpu`         | $0.000100        |
| `compute.gpu_t4`      | $0.000225        |
| `compute.gpu_l40s`    | $0.000975        |
| `compute.gpu_a100_80` | $0.001400        |
| `compute.gpu_h100`    | $0.001525        |

### Step 3: Ingest Usage Events After Each Prediction

Send a usage event to Dodo every time a model execution completes. Include the hardware tier as the event name and execution seconds in the metadata.

```typescript
import DodoPayments from "dodopayments";

type HardwareTier = "cpu" | "gpu_t4" | "gpu_l40s" | "gpu_a100_80" | "gpu_h100";

const client = new DodoPayments({
  bearerToken: process.env.DODO_PAYMENTS_API_KEY,
});

async function trackPrediction(
  customerId: string,
  modelId: string,
  hardware: HardwareTier,
  executionSeconds: number,
  predictionId: string,
) {
  await client.usageEvents.ingest({
    events: [
      {
        event_id: `pred_${predictionId}`,
        customer_id: customerId,
        event_name: `compute.${hardware}`,
        timestamp: new Date().toISOString(),
        metadata: {
          execution_seconds: executionSeconds,
          model_id: modelId,
          hardware: hardware,
        },
      },
    ],
  });
}

// Example: SDXL image generation - 8.3 seconds on A100
await trackPrediction(
  "cus_abc123",
  "stability-ai/sdxl",
  "gpu_a100_80",
  8.3,
  "pred_xyz789",
);
```

The `event_id` ensures idempotency - if a network retry sends the same event twice, Dodo deduplicates it.

### Step 4: Wrap Execution with Precise Timing

For [accurate metered billing](https://dodopayments.com/blogs/metered-billing-accurate-billing), wrap model execution with high-resolution timing.

```typescript
async function runModelWithMetering(
  customerId: string,
  modelId: string,
  hardware: HardwareTier,
  input: Record<string, unknown>,
) {
  const predictionId = `pred_${Date.now()}`;
  const startTime = performance.now();

  try {
    const result = await executeModel(modelId, input, hardware);
    const executionMs = performance.now() - startTime;
    const billedSeconds = Math.round(executionMs / 100) / 10;

    await trackPrediction(
      customerId,
      modelId,
      hardware,
      billedSeconds,
      predictionId,
    );

    return result;
  } catch (error) {
    // Bill for compute time even on failure
    const executionMs = performance.now() - startTime;
    if (executionMs > 1000) {
      await trackPrediction(
        customerId,
        modelId,
        hardware,
        Math.round(executionMs / 100) / 10,
        predictionId,
      );
    }
    throw error;
  }
}
```

This pattern ensures every prediction gets metered, including failed ones that still consumed GPU time.

### Cost Estimation for Users

Since pure usage-based billing can feel unpredictable, surface cost estimates before users run a model. This reduces billing surprises and builds trust.

| Model            | Hardware  | Avg Time | Estimated Cost |
| :--------------- | :-------- | :------- | :------------- |
| SDXL (image)     | A100 80GB | ~8 sec   | ~$0.0112       |
| Llama 3 (text)   | A100 80GB | ~3 sec   | ~$0.0042       |
| Whisper (audio)  | T4 GPU    | ~15 sec  | ~$0.0034       |
| Video generation | H100      | ~45 sec  | ~$0.0686       |

For a deeper walkthrough including heartbeat metering for long-running jobs, reserved capacity modeling, and the Time Range Ingestion Blueprint, see the full [Replicate Billing Deconstruction](https://docs.dodopayments.com/developer-resources/billing-deconstructions/replicate) on Dodo's docs.

## When This Model Makes Sense for Your Platform

Replicate's per-second hardware billing is not the right fit for every AI product. It works best under specific conditions.

**Build this model if:**

- Your platform runs variable-duration workloads where execution time is unpredictable
- Users choose between different compute resources (CPU vs. GPU tiers)
- You want to pass hardware costs through directly without margin stacking
- Your target users are developers who understand infrastructure pricing
- You are building a model marketplace or inference platform that hosts many models

**Consider a different model if:**

- Your users are non-technical and expect simple, predictable pricing
- Workloads are uniform enough that per-request pricing would be simpler
- You need strong retention mechanics like [subscription tiers](https://dodopayments.com/blogs/subscription-model-best-practices) or credit systems
- Your platform runs a single model where hardware abstraction adds no value

Many platforms combine approaches - starting with pure usage-based pricing for [monetizing AI inference](https://dodopayments.com/blogs/monetize-ai) and later adding subscription tiers for enterprise customers. Dodo Payments supports both [usage-based](https://docs.dodopayments.com/features/usage-based-billing/introduction) and subscription billing on the same product, so you can evolve your pricing without [migrating billing systems](https://dodopayments.com/blogs/billing-system-migration-mistakes).

For more examples of how AI companies structure their billing, see our breakdowns of [OpenAI](https://dodopayments.com/blogs/openai-billing-model), [ElevenLabs](https://dodopayments.com/blogs/elevenlabs-billing-model), and [Midjourney](https://dodopayments.com/blogs/midjourney-billing-model). For a comparison of billing approaches across the AI industry, check out [AI pricing models explained](https://dodopayments.com/blogs/ai-pricing-models).

Ready to build usage-based billing for your own AI platform? [Dodo Payments](https://dodopayments.com) handles metering, invoicing, and [global tax compliance](https://dodopayments.com/blogs/how-to-avoid-global-tax-mistakes-solopreneur) so you can focus on your models. Check out [pricing](https://dodopayments.com/pricing) or explore the [developer docs](https://docs.dodopayments.com/developer-resources/billing-deconstructions/replicate) to get started.

## FAQ

### Does Replicate charge a monthly subscription fee?

No. Replicate uses pure usage-based billing with no monthly fee, no minimum spend, and no annual contract. You only pay for the compute seconds your predictions consume, billed at the per-second rate for whichever hardware tier your model runs on.

### Are cold starts included in Replicate's billing?

Yes. When a model has not been used recently, loading it into GPU memory takes 10-30 seconds, and that time is billed at the same per-second rate as execution time. Subsequent requests while the model stays warm avoid this cost.

### How does Replicate's pricing compare to hourly cloud GPU pricing?

Replicate's per-second billing is significantly cheaper for short inference tasks. An 8-second A100 job costs $0.0112 on Replicate. On a platform that bills by the minute or hour, you would pay for the full minimum billing increment even if you only used a fraction of it.

### Can I build Replicate-style per-second billing for my own AI platform?

Yes. Using Dodo Payments' [usage-based billing](https://docs.dodopayments.com/features/usage-based-billing/introduction), you can create separate meters for each hardware tier, set per-second rates, and ingest usage events after each prediction completes. The full implementation is documented in the [Replicate Billing Deconstruction](https://docs.dodopayments.com/developer-resources/billing-deconstructions/replicate).

### What is the difference between public and private model billing on Replicate?

Public models share infrastructure and you only pay for active processing time plus cold starts. Private models run on dedicated hardware, and you pay for all time the instance is online - including idle time. Fast booting fine-tunes are a hybrid option that uses shared infrastructure but only bills for active processing.

## Final Take

Replicate's billing model is infrastructure-honest pricing at its purest. By tying cost directly to hardware and time, users never overpay for compute they did not use. The tradeoff is unpredictable bills and zero built-in retention - but for a developer-focused infrastructure platform, those tradeoffs are worth making.

If you are building an [AI inference platform](https://dodopayments.com/blogs/monetize-fine-tuned-model-api), a [model marketplace](https://dodopayments.com/blogs/best-payment-api-ai-agents), or any product where compute costs vary by hardware and duration, this is the billing pattern to study. Pair it with [Dodo Payments' metering infrastructure](https://dodopayments.com/blogs/best-billing-platform-usage-based-pricing) to ship the same model without building billing from scratch.
---
- [More AI articles](https://dodopayments.com/blogs/category/ai)
- [All articles](https://dodopayments.com/blogs)