# Metering LLM Token Usage: An Architecture Guide for AI SaaS

> Build a metering pipeline for LLM token usage that survives retries, supports multiple providers, and feeds clean billing events. Architecture and pitfalls.
- **Author**: Ayush Agarwal
- **Published**: 2026-05-17
- **Category**: AI, Architecture, Usage-based Billing
- **URL**: https://dodopayments.com/blogs/metering-llm-token-usage-architecture

---

Metering LLM token usage looks simple on the surface. The provider returns a usage object on every completion. You take the input and output token counts and write them somewhere. At the end of the month you sum them up and charge. In practice the pipeline that does this reliably for a real product is a meaningful piece of architecture, and getting it wrong costs both money and trust.

This article walks through the architecture of an LLM metering pipeline that holds up under real conditions. Multiple providers, multi step agents, retries, partial failures, and cross region traffic all create edge cases that a naive implementation gets wrong. The framing assumes a SaaS or AI product, not ecommerce.

## What an LLM metering pipeline actually has to do

A useful pipeline does five things, all of them at once and all of them reliably.

It captures token counts at the boundary where the LLM call happens, including both input and output tokens, and including any reasoning tokens or background calls that were billed by the provider.

It attributes each token count to a specific customer, often a specific user within a customer, and optionally to a feature or surface within your product. Without attribution the data is useless for billing and only marginally useful for analytics.

It survives retries cleanly. If your code retries a failed completion, the meter must record the calls that actually succeeded and not double count. If your code retries an event ingestion call, the billing system must deduplicate.

It handles partial failures gracefully. If the provider call succeeds but your meter ingestion fails, the customer was already served. The system must reconcile rather than lose the event or block the user.

It supports the billing primitives you actually use. If your billing model is per token with a quota and overage, the meter has to feed into the right shape. If your model includes capacity packs or prepay credits, the meter has to interact with those balances.

A pipeline that does all five well is robust, predictable, and cheap to operate. A pipeline that does any of them poorly leaks money or confuses customers.

## The architectural layers

Most real pipelines fall into four layers, even when the implementations vary.

### The instrumentation layer

This is the code that wraps the LLM call and observes what happened. The cleanest pattern is a thin wrapper around your provider client that intercepts requests and responses, records the usage object that the provider returns, and continues without changing the call semantics. It runs synchronously with the LLM call but should not block it. If the wrapper fails internally, the LLM call still succeeds and the failure is logged.

The wrapper captures input tokens, output tokens, the model name, the customer identifier, and optionally a feature tag. The customer identifier comes from your application context. The feature tag describes which part of your product the call belongs to, which is essential for analytics later.

For multi provider products the wrapper normalises across providers. OpenAI, Anthropic, Google, and AI SDK all return usage in slightly different shapes. The wrapper produces a uniform internal representation so downstream layers do not have to special case provider details.

### The transport layer

This is the code that gets the usage event from your application to your billing system. The naive version is a direct API call from the wrapper to the billing API. This works at small scale and breaks at larger scale or under failure conditions.

The robust version uses a queue. The wrapper writes to a local queue. A worker consumes the queue and submits batches to the billing API. The queue absorbs spikes, smooths out bursts, and provides retry semantics independently of the LLM call. If the billing API is briefly unavailable, the queue holds the events and the worker drains them when it recovers.

For products with significant traffic the queue is usually a managed service. Cloud Pub Sub, SQS, Kafka, or similar. For smaller products an in process queue with persistent storage works as long as the persistence is real. Losing the queue contents on a restart is a billing data loss event.

### The aggregation layer

This is where events become billable units. In Dodo Payments and similar platforms, this layer is hosted by the billing system itself. You define meters in the platform that describe how to aggregate raw events into a billable quantity. A meter for token usage is typically a sum aggregation over the input tokens or output tokens or total tokens property.

The aggregation runs continuously. As events arrive, the meter updates the running totals for each customer. Subscriptions and quotas reference these totals. Overage triggers when totals exceed quotas. Invoices read from the totals at cycle end.

You can keep your own aggregation in your own database for analytics. Many teams do. The billing system is the source of truth for what gets charged. Your local copy is for dashboards and product features.

### The reconciliation layer

This is the layer that catches mistakes. Every meaningful pipeline has a reconciliation job that runs daily or hourly. It compares the provider invoices, the events ingested into the billing system, and the local copy in your database. Discrepancies are logged and investigated.

Real causes of discrepancies include events lost in transit due to a network partition, events double counted because of a retry without idempotency, model changes at the provider that change the token counting rules, and timezone mismatches in date boundary aggregations. None of these are exotic. All of them happen.

The reconciliation job is the difference between a pipeline that is correct and a pipeline that thinks it is correct. Without reconciliation, errors accumulate silently.

## Common architectural mistakes

Several patterns reliably cause pain and are worth avoiding from day one.

Counting tokens on the way in rather than on the way out. Estimating tokens from the prompt and reply lengths is tempting because it is simple, but the actual provider count is the only number that matches the provider invoice. Always read the usage object from the provider response.

Skipping the customer identifier on system level calls. Background summarisation, embedding refreshes, and other internal calls still cost money. If you do not tag them with a customer identifier or an internal tag, they show up as unattributed cost. Some teams attribute background work to the user who triggered it. Others attribute it to a synthetic system customer. Either is fine. Untagged is not.

Using request retries without idempotency. If your code retries failed billing API calls without an idempotency key, you can end up double counting events. The billing system has no way to know the second call was a retry rather than a separate event. Pass a stable idempotency key per logical event.

Coupling the meter directly to the LLM call without a queue. If the billing API is slow or briefly down, your LLM calls slow down or fail. Customers feel the outage even though the LLM is working fine. Decouple the two through a queue.

Forgetting to handle multi step agent calls. A single user request might trigger six provider calls. If you only meter the user facing call, you undercount. If you double count by adding both the parent and children, you overcount. Decide on a clear policy. Most teams meter every actual provider call and roll up to the user request only for analytics, not for billing.

Not testing the pipeline under failure. The pipeline that works in steady state is the easy one. The interesting question is what happens when the queue fills up, when the billing API returns errors, when the LLM provider has a partial outage, when your worker crashes. Run failure tests deliberately and watch what happens.

## How Dodo Payments fits into this architecture

Dodo Payments provides the aggregation layer and an ingestion API for the transport layer. You define a meter in the dashboard that describes the aggregation, you call the events ingestion API from your worker, and the platform handles the rest.

For LLM specific use cases the platform offers ingestion blueprints, which are SDKs that wrap common LLM clients and emit usage events automatically. The blueprints support AI SDK, OpenAI, Anthropic, OpenRouter, Groq, and Google Gemini. Each blueprint normalises the provider's usage object and emits an event with the right customer identifier and event name.

The architectural advantage is that the wrapper, transport, and aggregation are all production tested. You add the SDK to your application, configure your API key and meter event name, and your usage is automatically tracked. You still need the reconciliation layer on top, and you still need to make decisions about attribution and multi step agents, but the plumbing is solved.

The full reference for the LLM blueprint and the events ingestion API lives in the [LLM ingestion blueprint](https://docs.dodopayments.com/developer-resources/ingestion-blueprints/llm) and the [usage based billing guide](https://docs.dodopayments.com/developer-resources/usage-based-billing-guide). This article gives you the architecture. The docs give you the implementation.

## A reference architecture

Putting all the layers together, a clean LLM metering pipeline looks roughly like this.

Your application code calls a wrapped LLM client for every completion. The wrapper captures the usage object from the provider response. The wrapper writes a structured event to a local queue. The event includes the customer identifier, the meter event name, the input and output tokens, the model name, and any feature tags.

A worker consumes the queue. The worker batches events and submits them to the events ingestion API with idempotency keys. The worker retries on transient failures and surfaces persistent failures to your monitoring. Successfully ingested events are removed from the queue.

The billing system aggregates events into meters. Meters feed into subscriptions and overage rules. Invoices are produced at cycle end based on meter totals.

A reconciliation job runs daily. It compares the events you sent to the billing system against your local copy and against the provider invoice. Discrepancies are logged. A human investigates anything that exceeds an alert threshold.

Your application reads meter totals through the billing system API to show usage in your product user interface. This is the same data that drives invoices, so the customer never sees one number in your product and a different number on their bill.

## Operational hygiene

A few habits make the pipeline work over the long run.

Set up alerts on queue depth. Growing queue depth means the worker cannot keep up with incoming events. Investigate before it becomes a billing delay.

Set up alerts on ingestion error rates. Persistent errors usually mean a configuration drift between your meter event names and what your worker is sending.

Run the reconciliation job and look at it. The job is only useful if a human reads the report. Build it into a regular ops rhythm.

Tag every event with the application version that produced it. When you change models or prompt structure, the resulting shifts in token counts show up in the data. Without versioning you cannot tell whether a change in average tokens per request is a model change, a code change, or a customer behaviour change.

Document the policy on multi step agents and background calls. If the policy is in someone's head, it will drift. Write it down and reference it from the wrapper code so future engineers see it.

## Closing thought

LLM metering looks like a small problem and turns out to be a real one. The simple version, count tokens and write them somewhere, works for the first month and starts breaking by the third. The robust version has clear instrumentation, a queue, a hosted aggregation layer, and a reconciliation job. None of the layers are individually complex. Together they are the difference between a billing system you trust and one that drifts away from reality.

If you are building an AI product right now, get this architecture right early. The cost of building it correctly the first time is much smaller than the cost of cleaning up six months of drifted data. The patterns described here are not exotic. They are the same patterns that observability and analytics systems have used for years. Apply them to your billing data and you will find that the rest of the billing surface gets much easier to reason about.

## FAQ

### Should I meter input tokens, output tokens, or both?

Both, separately. Input and output tokens have different cost structures at most providers, and your billing model usually wants to apply different rates or weights to each. Storing them as separate properties on the meter event lets you compute either or both depending on what you charge for.

### How do I handle multi step agents that fan out to many provider calls?

Meter every actual provider call so the cost data matches the provider invoice. For billing, decide whether to charge per agent run or per underlying call. Per agent run is simpler for the customer but requires you to absorb variance. Per underlying call exposes the variance to the customer but matches your costs.

### What happens if I miss an event?

Missing events mean lost revenue. The reconciliation layer catches missed events by comparing what you have to what the provider invoice shows. When the gap is small, you absorb it. When it is large, the reconciliation alerts you to investigate the pipeline. Either way you need the visibility.

### How important is the queue between the wrapper and the billing API?

Very. Without a queue, every billing API hiccup becomes an LLM call hiccup that customers feel. With a queue, the billing pipeline can absorb minutes of upstream slowness without affecting product behaviour. For any product at meaningful scale, the queue is essential.

### Can I use the same pipeline for analytics and billing?

The same events can feed both, but the source of truth for billing should be the platform that issues the invoices. Your local copy is fine for analytics dashboards. When the two diverge, the platform side is what the customer was charged. Treat the divergence as a signal that the reconciliation job needs to run.
---
- [More AI articles](https://dodopayments.com/blogs/category/ai)
- [All articles](https://dodopayments.com/blogs)