← Back to writing
Writing · ai search

AI Search Crawlers and Your Ecommerce Store: The 2026 Robots.txt Field Guide

By Leo Nguyen · Jun 18, 2026 · 11 min read
AI Search Crawlers and Your Ecommerce Store: The 2026 Robots.txt Field Guide

Short answer

Six AI user-agents matter for ecommerce in 2026, and they split into two jobs: training the model and fetching pages live when a user asks a question. GPTBot, ClaudeBot, and PerplexityBot are training and indexing crawlers. OAI-SearchBot, ChatGPT-User, and Claude-User are live-retrieval bots — they fetch your page in real time when someone asks ChatGPT or Claude a question that references your store. Blocking the training bots does not block the live bots. For most stores, the cleaner default in 2026 is to allow live-retrieval bots so your pages can still be cited in real-time answers, and to decide separately whether to allow training bots based on competitive and IP concerns. Pair robots.txt with a short llms.txt that points the allowed bots at your pillar content.

Quick diagnosis

  • Open https://yourstore.com/robots.txt in a browser. If you see only User-agent: * blocks, you have no AI-specific rules and you are giving every crawler the same default treatment.
  • Check whether your robots.txt has explicit User-agent: GPTBot, User-agent: ClaudeBot, or User-agent: PerplexityBot entries. If not, you are accepting whatever default each bot's politeness logic chooses.
  • Test one URL with curl and a spoofed user-agent to confirm the page is reachable to an AI fetch: curl -A "OAI-SearchBot" https://yourstore.com/your-pillar-page. A 200 response means the live bot can reach you.

Three checks. Under five minutes.

Why this got complicated in 2024 and 2025

Before late 2023, AI crawlers were mostly a single user-agent per company. OpenAI shipped GPTBot in August 2023. Google added Google-Extended in September 2023 as a way to opt your site out of Bard and Gemini training without blocking Googlebot. Anthropic published their crawler documentation in 2024 covering ClaudeBot and the Claude-User on-demand fetch agent. Perplexity rolled out PerplexityBot for indexing and a separate Perplexity-User user-agent for live retrieval.

The split mattered because the use cases are different. A training crawler downloads pages in bulk to feed a model's pretraining corpus — the model learns from your content but does not need to fetch it again at query time. A live-retrieval bot fetches a specific page on demand when a user asks a question that mentions your store or a topic your store covers. The training bot influences whether your brand appears in the model's parametric memory. The live bot influences whether your specific URL gets cited in a real-time answer.

In 2026, most of the AI citation traffic that lands in your analytics comes through live-retrieval bots, not training. When ChatGPT cites a Shopify store as a source, the fetch is happening at query time through OAI-SearchBot or ChatGPT-User. When Perplexity surfaces a comparison page from a DTC brand, the fetch is happening through Perplexity-User. The training crawlers shape the long-term model knowledge; the live bots shape what gets cited today.

This is the practical implication: a robots.txt that blocks all AI user-agents — a pattern that spread through blog posts in late 2023 — leaves money on the table in 2026 because it blocks the live bots that drive real-time citation traffic.

The six user-agents that matter

Here is the working list for an ecommerce store in mid-2026. Each entry has a name, what the bot is for, and whether it is a training crawler or a live-retrieval bot.

OpenAI

  • GPTBot — training crawler for OpenAI models. Documented at platform.openai.com/docs/bots. Honors robots.txt.
  • OAI-SearchBot — search index crawler that powers ChatGPT search results. Separate from GPTBot. Honors robots.txt.
  • ChatGPT-User — on-demand fetch when a ChatGPT user clicks a link or pastes a URL. Behaves like a browser hit triggered by a user.

Anthropic

  • ClaudeBot — training crawler. Documented at support.anthropic.com under web crawler topics. Honors robots.txt.
  • Claude-User — on-demand fetch when a Claude user references a specific URL or asks Claude to browse to a page.

Perplexity

  • PerplexityBot — index crawler. Documented at docs.perplexity.ai under PerplexityBot. Honors robots.txt.
  • Perplexity-User — live retrieval bot that fetches pages when answering a user's question. Perplexity has stated that this bot does not respect robots.txt because the fetch is initiated by a user, not the engine, though this remains an area of public discussion.

Google

  • Google-Extended — not a user-agent string for crawling; it is a token you place in robots.txt to opt your site out of Google's Gemini training pipeline while still allowing Googlebot to index for search.
  • Googlebot — standard search crawler, unchanged.

Common Crawl

  • CCBot — third-party crawler whose dataset feeds many AI training pipelines including OpenAI's earlier models. Many sites that want to limit AI training exposure also block CCBot.

That is the working set. There are smaller crawlers — Meta's FacebookBot for AI products, Bytespider for ByteDance, Amazonbot for Amazon's product Q&A, Apple-Extended for Apple Intelligence — and the list grows quarterly. The principle is the same for all of them: identify whether the bot is a training crawler or a live-retrieval bot, and treat each accordingly.

What a working ecommerce robots.txt looks like in 2026

Here is a template that an ecommerce store can adapt. It allows live-retrieval bots, leaves the training bot decision to the operator, locks down admin and account paths, and points to a sitemap. The training bot block is shown commented out — a store can uncomment it to opt out of training.

# Standard rules for all crawlers
User-agent: *
Disallow: /admin
Disallow: /account
Disallow: /cart
Disallow: /checkout
Disallow: /search?
Allow: /

# Live-retrieval bots — allowed by default to preserve real-time citations
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

# Training crawlers — uncomment the Disallow lines if you want to opt out
User-agent: GPTBot
# Disallow: /
Allow: /

User-agent: ClaudeBot
# Disallow: /
Allow: /

User-agent: PerplexityBot
# Disallow: /
Allow: /

User-agent: CCBot
# Disallow: /
Allow: /

# Google AI training opt-out (does not affect Googlebot for search)
User-agent: Google-Extended
# Disallow: /
Allow: /

Sitemap: https://yourstore.com/sitemap.xml

Three things to notice. First, every AI user-agent gets its own block. Second, the training bots and live bots are separated. Third, the standard User-agent: * block still locks down admin, account, cart, and checkout paths for every crawler — these should never be exposed to any bot, AI or otherwise.

If you want to block AI training entirely while keeping live citations alive, uncomment the Disallow: / lines under the four training-bot blocks (GPTBot, ClaudeBot, PerplexityBot, CCBot) and the Google-Extended block. The live retrieval bots — OAI-SearchBot, ChatGPT-User, Claude-User, Perplexity-User — stay allowed, which preserves your real-time citation path.

What changed in 2026 — and what to verify yourself

Two patterns are worth flagging because they shift the picture.

First, AI engines are increasingly distinguishing training crawl from live retrieval at the user-agent level. The clean split between OpenAI's GPTBot and OAI-SearchBot is the clearest example, and Anthropic's split between ClaudeBot and Claude-User follows the same shape. The implication for ecommerce is that the simple "block all AI bots" robots.txt pattern that spread in 2023 has aged badly. A store that blocked everything under one Disallow rule in 2023 is now blocking the live bots that would have cited their product comparison and pillar pages in 2026.

Second, the public guidance from each engine is moving toward "honor robots.txt for training crawlers, treat user-initiated fetches more leniently." Perplexity has been the most explicit about this, stating that Perplexity-User fetches are initiated by a user and treated differently from PerplexityBot. The implication is that even a Disallow on Perplexity-User may not stop a fetch when a user explicitly asks Perplexity to read your page. This is an area where the standards are still settling and the right response is to test.

The cleanest way to verify is to fetch your own page with a spoofed user-agent string and watch your logs. curl -A "OAI-SearchBot" -I https://yourstore.com/your-pillar-page should return 200 if your robots.txt allows the bot, or be blocked at the application layer if you have additional logic. Do the same for ClaudeBot, PerplexityBot, and the live bots. If anything returns an unexpected 403 or a redirect chain, that is a sign your robots.txt rule did not parse as intended or that a CDN-layer rule is overriding it.

How to deploy on Shopify, Magento, and headless

Shopify. Customize the robots.txt.liquid template introduced in 2021. In your theme code editor, create or edit templates/robots.txt.liquid and place the user-agent blocks there using Liquid. The change is live at yourstore.com/robots.txt within minutes of saving. Note that Shopify wraps the file with its own default rules — make sure your additions do not conflict with the auto-generated User-agent: * block at the top.

Magento 2. The file is editable via Admin → Content → Design → Configuration → Search Engine Robots, or you can place a static robots.txt directly at pub/robots.txt and ensure your webserver does not override it. If you run multiple stores on one Magento install, the per-store robots.txt configuration is the cleaner option.

Headless (Next.js, Remix, custom). Create a static robots.txt in your public directory or a route handler that returns the file content with Content-Type: text/plain. For Next.js App Router, a robots.ts file at the project root exports a function that builds the file dynamically. For Remix, a routes/robots[.]txt.ts exports a loader. Test the deployed URL with curl after each change.

In all three stacks, the verification step is the same: hit https://yourstore.com/robots.txt from a clean browser and check the live file matches your intent. Caching layers, CDN overrides, and platform default templates are the three most common reasons a robots.txt change does not show up live, and curl with -I will surface any caching headers that are masking your edit.

The pairing with llms.txt

Robots.txt controls access. llms.txt curates content. The two files stack and serve different jobs.

A 2026 ecommerce setup that takes AI search seriously runs both. Robots.txt allows the live-retrieval bots and locks down admin and account paths. llms.txt points the allowed bots at your highest-leverage pages — pillar guides, top product collections, FAQ hubs, comparison content. The bot reads robots.txt to know what it is allowed to fetch, then reads llms.txt to know what is worth fetching first.

If you only have time to ship one this quarter, ship robots.txt with explicit user-agent blocks for the six AI crawlers. The change protects against accidentally blocking live citations and signals to each engine that you are treating them as first-class crawlers. Ship llms.txt next quarter once you have at least five pillar pages worth pointing at.

What this looks like in our work

In the audits we run for premium DTC stores, the most common finding is not that a store is blocking AI bots intentionally — it is that the store has a 2023-vintage robots.txt with no AI-specific rules at all, which means every AI crawler gets the same default treatment as Googlebot. The fix is a 30-minute robots.txt edit and a curl verification round. The result, two weeks later, is usually a small but measurable lift in real-time citations from ChatGPT and Perplexity because the live-retrieval bots can now reach pages that were already indexable but lost in the noise of an unstructured crawl budget.

The pattern observation matters more than any single number. Robots.txt is the cheapest leverage point in the AI Visibility stack — under an hour of work, no content changes, no schema rewrite — and most ecommerce stores have not touched theirs since they first launched. Ship the AI user-agent blocks, ship the curl verification, then move to llms.txt and structured data.

Where to go next

If you have not yet published an llms.txt, the llms.txt for Ecommerce spec walkthrough is the companion piece — what goes in the file, what stays out, and where it deploys on Shopify, Magento, and headless.

If your structured data is the bottleneck, Structured Data and Entity Authority: The 200-Word Rule for AI Citations covers the schema patterns that move the needle on ChatGPT and Perplexity citations.

If you want a broader playbook for AI Search visibility across the full stack, How to Optimize Ecommerce for AI Search (2026 Playbook) covers the seven-layer model that ties robots.txt, llms.txt, schema, entity, content, citations, and measurement together.

Frequently asked
Which AI crawlers should an ecommerce store care about in 2026?
At minimum, six user-agents are worth knowing: GPTBot (OpenAI training crawler), OAI-SearchBot (ChatGPT live search retrieval), ChatGPT-User (on-demand fetch when a user pastes a URL), PerplexityBot (Perplexity index crawler), ClaudeBot (Anthropic training crawler), and Claude-User (on-demand fetch for Claude users). Google-Extended is a flag rather than a crawler, but it tells Google whether your pages can train Gemini. Each engine has split its crawling into a training bot and a live-retrieval bot since late 2024, and the distinction matters because blocking the training bot does not block the live one — and the live one is what drives the citation when a user asks a question in real time.
Will blocking GPTBot or ClaudeBot hurt my AI citations?
Probably not for live citations, but it limits long-term training exposure. OpenAI and Anthropic both publish that their live-retrieval bots (OAI-SearchBot, Claude-User, ChatGPT-User) are governed by separate user-agents from the training crawlers (GPTBot, ClaudeBot). If you block only GPTBot and ClaudeBot, your store can still be cited in real-time answers because the live bots can still fetch the page. If you block all of them, including the live retrieval bots, you cut off the on-demand fetch path that drives most of today's AI citation traffic. For most ecommerce stores, the cleaner default is to allow live-retrieval bots and decide separately whether to allow training bots based on your IP and competitive risk.
Should I use llms.txt or robots.txt to control AI crawlers?
They do different things and stack. Robots.txt controls access — which paths each user-agent is allowed to fetch. llms.txt is a content recommendation — a curated list of URLs you want AI models to prioritize when they do crawl. A practical 2026 setup uses both: robots.txt to allow live-retrieval bots while optionally blocking training bots, and llms.txt to point the allowed bots at your highest-leverage pages (pillar guides, top collections, FAQ hubs). Robots.txt is enforcement; llms.txt is recommendation.
Do AI crawlers actually respect robots.txt?
The major ones publicly commit to honoring robots.txt: OpenAI, Anthropic, Perplexity, and Google all document their user-agent strings and state that their crawlers honor Disallow rules. Compliance has been studied in the wild and the picture is uneven for smaller AI startups, but the engines that produce most of today's citation traffic do respect the standard. Reality check: robots.txt is voluntary, served as a recommendation, and there is no enforcement layer. If you have content you actively need to keep out of AI training data, robots.txt is necessary but not sufficient — gate it behind login or remove it.
Where do most ecommerce stores get this wrong in 2026?
Three common patterns. (1) Copy-pasting a single Disallow: / for all AI user-agents because a blog post said to, which blocks the live-retrieval bots that drive real-time citations. (2) Leaving robots.txt fully open including admin and account paths, which exposes login URLs and customer data routes to crawlers. (3) Treating robots.txt as the only AI-facing file and skipping llms.txt entirely, which means AI bots crawl your low-signal pages along with your high-signal ones and your citation rate stays flat. The fix in each case is granular — separate the training bots from the live bots, lock down admin paths, and publish a short llms.txt that points the allowed bots at your pillar content.
How do I implement this on Shopify, Magento, or headless?
Shopify lets you customize robots.txt via the robots.txt.liquid template introduced in 2021. You add user-agent blocks for each AI crawler with the rules you want and the live preview at yourstore.com/robots.txt reflects them within minutes. Magento 2 robots.txt is editable via Admin → Content → Design → Configuration → Search Engine Robots, or directly as a static file at pub/robots.txt. Headless stacks (Next.js, Remix, custom) handle robots.txt as a static file or a server-rendered route, depending on framework. In every case, test the live URL with curl after deploying and verify each user-agent block parses correctly before assuming it works.