` tags where the subscription billing content should have been. It was building its understanding of our product from a half-rendered HTML skeleton. This wasn't an isolated case. Perplexity was returning mangled pricing tables. Claude was missing entire product categories. Every AI system was scraping the same HTML and parsing it differently - all of them getting it wrong in unique ways. We needed a fundamentally different approach. ## The Problem With HTML Scraping AI bots scraping HTML face three problems that no amount of SSR optimization can fully solve: ``` Why HTML Scraping Fails for AI Agents ═══════════════════════════════════════════════════════════════════ Problem 1: Rendering Gaps ────────────────────────── Browser AI Bot ┌──────────────────────┐ ┌──────────────────────┐ │ Load HTML │ │ Load HTML │ │ Execute JavaScript │ │ See raw DOM │ │ Hydrate React │ │ No JS execution │ │ See full content │ │ Empty placeholders │ └──────────────────────┘ └──────────────────────┘ Problem 2: Noise Ratio ─────────────────────── Useful content: ~15% of HTML bytes Navigation/footer: ~25% CSS/JS references: ~30% DOM structure: ~30% Problem 3: Inconsistent Parsing ─────────────────────────────── GPTBot: strips nav, keeps tables (usually) ClaudeBot: extracts text, loses structure PerplexityBot: keeps headings, mangles lists Result: each AI has a different (wrong) view ``` We tried making our HTML more bot-friendly. Added more server-side rendering. Moved content out of React islands. Added `aria-label` attributes everywhere. It helped marginally, but the fundamental issue remained: HTML is designed for visual rendering, not machine consumption. ## The Mental Shift We stopped trying to make HTML work for bots and asked a different question: what format do AI systems actually want? The answer was obvious once we said it out loud. These are language models. They want text. Structured, clean, well-organized text with headings and links. They want markdown. ``` Old Approach (Fighting Scrapers) ═══════════════════════════════════════════════════════════════════ AI Bot Our Website ┌────────────┐ ┌────────────────────────────┐ │ GPTBot │ ── GET ──────► │ HTML page (full layout) │ │ │ ◄── response── │ Nav + CSS + JS + React │ │ │ │ 85% noise, 15% content │ │ Parse HTML │ └────────────────────────────┘ │ Hope for │ │ the best │ └────────────┘ New Approach (Serving What They Want) ═══════════════════════════════════════════════════════════════════ AI Bot Our Edge Worker ┌────────────┐ ┌────────────────────────────┐ │ GPTBot │ ── GET ──────► │ Detect: AI bot? │ │ │ ◄── response── │ YES -> serve .md file │ │ │ │ 100% content, 0% noise │ │ Clean │ └────────────────────────────┘ │ markdown │ │ Perfect │ └────────────┘ ``` The insight was simple: **build a markdown version of every page at compile time, and serve it to bots instead of HTML.** ## Building Dual-Format Pages Our website is built with Astro 5 - a static site generator that outputs HTML. We extended the build to also produce markdown for every content page. Each page gets two source files: ``` Dual-Endpoint Build Pattern ═══════════════════════════════════════════════════════════════════ Source Files Build Output ┌─────────────────────┐ ┌─────────────────────────┐ │ [slug].astro │──────►│ /tax/gst-india.html │ │ (HTML template) │ │ (full page, CSS, layout) │ ├─────────────────────┤ ├─────────────────────────┤ │ [slug].md.ts │──────►│ /tax/gst-india.md │ │ (markdown endpoint) │ │ (clean text, headings) │ └─────────────────────┘ └─────────────────────────┘ │ │ └──── shared ──────────────────┘ getStaticPaths() Content collection data ``` Both files share the same `getStaticPaths()` function and pull from the same content collection. The markdown endpoint is tiny - roughly 17 lines of code. It calls a converter function that transforms the content entry into clean, structured markdown. The markdown converter handles the tedious but critical work: - Strips all HTML artifacts - Normalizes 30+ Unicode characters to ASCII (typographic quotes to straight quotes, currency symbols like INR to "INR", em dashes to hyphens) - Cleans up Mathematical Bold Unicode characters that sneak in from copy-paste - Appends cross-links to related content - Adds a standard platform context section at the end That Unicode normalization turned out to be far more important than we expected. AI models handle `INR 15,000` and `INR 15,000` very differently when generating responses. The ASCII version is consistently parsed correctly. ## Detecting Bots at the Edge With markdown files built and deployed as static assets, we needed a way to serve them to the right clients. We built a Cloudflare Worker that sits in front of everything. ``` Request Flow Through the Edge Worker ═══════════════════════════════════════════════════════════════════ Incoming Request │ ▼ ┌─────────────────────────────┐ │ Trailing slash? │──── YES ──► 301 redirect │ /path/ -> /path │ └─────────────┬───────────────┘ │ NO ▼ ┌─────────────────────────────┐ │ Skip path? │──── YES ──► Pass to Astro │ /admin, /api, /_*, assets │ └─────────────┬───────────────┘ │ NO ▼ ┌─────────────────────────────┐ │ AI bot detected? │──── NO ───► Pass to Astro │ User-Agent OR Accept header│ (add Link header) └─────────────┬───────────────┘ │ YES ▼ ┌─────────────────────────────┐ │ Fetch /path.md from assets │──── 404 ──► Check redirects │ │ │ └─────────────┬───────────────┘ ▼ │ 200 ┌───────────┐ ▼ │ Redirect │ ┌─────────────────────────────┐ │ map found?│ │ Return markdown response │ └─────┬─────┘ │ + Track in Analytics Engine│ │ YES └─────────────────────────────┘ ▼ Fetch target's .md ``` The `run_worker_first: true` configuration in Wrangler is what makes this possible. Every single request - including requests for static assets - hits our Worker before the default asset serving layer. This gives us the interception point we need. Bot detection checks the User-Agent string against 17 known AI crawlers: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Anthropic-ai, Claude-Web, PerplexityBot, Google-Extended, Applebot-Extended, cohere-ai, and several others. We also check the `Accept` header for `text/markdown` - this catches developer tools and custom integrations that explicitly request markdown. ## Response Headers That Matter When we serve markdown, we set headers that signal intent to both bots and search engines: ``` Markdown Response Headers ═══════════════════════════════════════════════════════════════════ Content-Type: text/markdown; charset=utf-8 X-Robots-Tag: noindex <- Prevents Google from indexing .md X-Markdown-Tokens: 1,847 <- Token count estimate Cache-Control: public, max-age=3600 ``` The `X-Markdown-Tokens` header deserves explanation. It contains a word-count-based token estimate for the response body. AI agents with limited context windows can read this header before consuming the body and decide whether they have room for it. It's a small thing, but it's the kind of signal that makes your content easier for AI systems to work with. Every HTML response also gets a `Link` header pointing to the markdown version: ``` Link: ; rel="alternate"; type="text/markdown" ``` This is the discovery mechanism. Even if a bot doesn't know about our markdown endpoints, the `Link` header tells it an alternative representation exists. It follows the same pattern as `rel="alternate"` for language variants or RSS feeds. ## Tracking What AI Systems Consume We couldn't improve what we couldn't measure. Cloudflare Analytics Engine gave us zero-latency write-and-forget tracking for every AI request. ``` Analytics Engine Schema ═══════════════════════════════════════════════════════════════════ ┌─────────────────────────────────────────────────────────┐ │ writeDataPoint({ │ │ indexes: [botName], // sampling key │ │ blobs: [ │ │ botName, // "GPTBot" │ │ pagePath, // "/pricing" │ │ country, // "US" (cf-ipcountry) │ │ hitOrMiss, // "hit" or "miss" │ │ userAgent.slice(0, 256) // truncated UA │ │ ], │ │ doubles: [ │ │ tokenCount, // word count of .md │ │ 1 // counter for SUM │ │ ] │ │ }) │ └─────────────────────────────────────────────────────────┘ ``` The `hit` vs `miss` distinction tells us whether a `.md` file existed for the requested path. A high miss rate on a particular path means we're missing markdown coverage there. Now we can answer questions we couldn't before: Which AI bot reads our pricing page most? What countries are AI queries coming from? How many tokens does our average page consume? Which pages have no markdown coverage? ## Handling Redirects for Bots We maintain 60+ URL redirects for legacy paths - old `/features/*` URLs, renamed blog slugs, restructured product pages. For human browsers, these are simple 301/302 redirects. But for AI bots, an HTML redirect page is useless. When a bot hits a redirected path, our Worker looks up the redirect target, fetches that target's `.md` file from static assets, and serves it directly. The response includes `X-Redirect-From` and `X-Redirect-To` headers so the bot knows the canonical URL has changed. For external redirects (pointing outside our domain), we return a short markdown notice explaining the redirect instead of an empty 302 response. ## Making Ourselves Visible Serving markdown is useless if bots can't crawl us. Our `robots.txt` is generated dynamically - on production, it explicitly allows every major AI crawler: ``` # Production robots.txt (simplified) User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / # ... and several more Sitemap: https://dodopayments.com/sitemap-index.xml ``` On non-production environments, everything is blocked with `Disallow: /`. We don't want AI systems training on our staging content. We also maintain `llms.txt` and `llms-full.txt` files - the emerging convention for telling AI systems about your site structure in a format they can consume directly. ## What We Got Wrong ### Server-Side Rendering for Bots Our first attempt was to detect bots and serve a fully server-rendered version of the HTML page. The theory was that SSR would eliminate the React hydration problem. In practice, SSR added 200-400ms of latency per request, the HTML was still noisy (navigation, footer, CSS classes), and we were maintaining two rendering paths for the same content. We abandoned this after two weeks. ### Underestimating Redirects We launched with markdown serving but no redirect handling. Bots hitting old URLs got empty 404 responses. Since AI systems cache aggressively, these 404s persisted in their indices for weeks. We didn't realize how many legacy URLs were still being crawled until we checked the Analytics Engine data and saw hundreds of daily misses on `/features/*` paths that had been renamed months ago. ### Unicode Normalization as an Afterthought We initially shipped markdown with the original Unicode characters from our content. Then we noticed AI responses quoting our text with garbled currency symbols. A INR sign would sometimes become a question mark, sometimes get dropped entirely. The normalization layer - converting all non-ASCII characters to their closest ASCII equivalents - was a weekend fix that should have been a launch requirement. ## The Results After three months of running dual-format serving: - **AI accuracy improved visibly.** ChatGPT, Claude, and Perplexity now correctly describe our subscription billing, credit-based billing, and merchant-of-record features. We spot-check weekly. - **Bot traffic is fully visible.** We know exactly which AI systems visit which pages, from which countries, and how many tokens they consume per visit. - **Zero impact on human users.** The Worker adds <1ms of latency for non-bot requests. Bot detection is a string comparison, not a regex. - **Markdown miss rate dropped to under 2%.** We monitor the hit/miss ratio and add coverage for any page that gets consistent bot traffic without a `.md` endpoint. ## Should You Do This? **Makes sense if:** - You have a content-heavy site (hundreds of pages with structured content) - You care about how AI systems represent your product - You're already using a static site generator that can output multiple formats - You're on Cloudflare Workers or a similar edge runtime that can intercept requests **Don't build this if:** - You have fewer than 20 pages - the overhead isn't worth it - Your content changes hourly - static markdown endpoints need rebuilds - You don't have an edge runtime - doing this at the origin server adds latency The honest answer: if AI systems are already scraping your site (check your access logs for GPTBot and ClaudeBot), they're building a representation of your product whether you like it or not. You can either let them figure it out from your HTML, or you can serve them exactly what you want them to know. ## Key Takeaways 1. **AI bots don't want HTML.** They want clean, structured text. Markdown is the natural format for language model consumption. 2. **Build markdown at compile time, not request time.** Static `.md` files served from the edge are faster and more reliable than any dynamic rendering approach. 3. **The edge worker pattern is powerful.** `run_worker_first` gives you a single interception point for all requests - bot detection, format switching, analytics, and redirects in one place. 4. **Track everything.** Without Analytics Engine data, we had no idea which bots were visiting, which pages they consumed, or where we had coverage gaps. 5. **Unicode normalization is not optional.** AI systems handle ASCII far more consistently than Unicode. Normalize before you serve. 6. **Redirects matter more than you think.** AI systems cache aggressively. A 404 on a legacy URL can persist in their indices for weeks. Handle redirects at the markdown level, not just HTML. 7. **`X-Markdown-Tokens` is a small header with outsized utility.** It lets AI agents make context-budget decisions before consuming your content. _We're building payment infrastructure at Dodo Payments. If edge computing, AI systems, and content infrastructure sound interesting, we're hiring._ --- - [More Infrastructure articles](https://dodopayments.com/blogs/category/infrastructure) - [All articles](https://dodopayments.com/blogs)