Reference Last verified Mon May 18 2026 02:00:00 GMT+0200 (Central European Summer Time)

AI Crawler Reference 2026

Name: AI Crawler Reference 2026
Creator: SitemapHost
Published: Mon May 18 2026 02:00:00 GMT+0200 (Central European Summer Time)
License: https://creativecommons.org/licenses/by/4.0/

Every known AI crawler in one table — user-agent string, operator, purpose, whether it respects robots.txt, JavaScript rendering support, and links to operator-published IP ranges. Refreshed quarterly. Skip to opt-out templates or how to detect AI traffic.

Crawler	User-agent	Operator	Purpose	robots.txt	Renders JS	Primary source
GPTBot	`GPTBot`	OpenAI	Training data for ChatGPT / GPT-4+ family	Yes	No	OpenAI docs
OAI-SearchBot	`OAI-SearchBot`	OpenAI	Indexes pages for ChatGPT Search (RAG live citations)	Yes	Yes	OpenAI docs
ChatGPT-User	`ChatGPT-User`	OpenAI	On-demand fetch when a ChatGPT user includes a URL in their prompt	Yes	Yes	OpenAI docs
ClaudeBot	`ClaudeBot`	Anthropic	Training data for Claude	Yes	No	Anthropic docs
Claude-User	`Claude-User`	Anthropic	On-demand fetch when a Claude user shares a URL	Yes	Yes	Anthropic docs
Claude-SearchBot	`Claude-SearchBot`	Anthropic	RAG indexing for Claude's web-search feature	Yes	Yes	Anthropic docs
PerplexityBot	`PerplexityBot`	Perplexity	Training and RAG indexing	Yes	Yes	Perplexity docs
Perplexity-User	`Perplexity-User`	Perplexity	On-demand fetch when a Perplexity user asks a question that triggers a URL fetch	Disputed*	Yes	Perplexity docs
Google-Extended	`Google-Extended` (token)	Google	Opt-out token for Gemini / Bard training (not a fetcher; signals Googlebot)	Yes	n/a	Google docs
Applebot-Extended	`Applebot-Extended` (token)	Apple	Opt-out token for Apple Intelligence training (signals Applebot)	Yes	n/a	Apple docs
Meta-ExternalAgent	`meta-externalagent`	Meta	Training data for Llama and Meta AI	Yes	No	Meta docs
Bytespider	`Bytespider`	ByteDance	Training data for Doubao and TikTok-side models	Disputed*	No	ByteDance page
CCBot	`CCBot`	Common Crawl	Public web archive used by many LLM training pipelines (indirect)	Yes	No	Common Crawl
cohere-ai	`cohere-ai`	Cohere	Training data for Command-family models	Yes	No	Cohere docs
Diffbot	`Diffbot`	Diffbot	Builds the Diffbot Knowledge Graph (used by enterprise RAG systems)	Yes	Yes	Diffbot docs
DuckAssistBot	`DuckAssistBot`	DuckDuckGo	RAG for DuckDuckGo's DuckAssist AI answers	Yes	Yes	DDG help

* Disputed: independent reports (e.g. Wired's 2024 Perplexity investigation, multiple webmasters on Bytespider) have shown these crawlers fetching pages even after a User-agent: {Name} / Disallow: / rule was in place. The operators dispute the findings or claim the behaviour is a bug. Treat with skepticism; firewall-level blocks may be needed if the user-agent rule is ignored.

Opt out of AI training and indexing

Add this block to your robots.txt at the root of your domain (e.g., https://example.com/robots.txt). It tells every reputable AI crawler not to fetch any URL on your site:

# Block AI crawlers — generated by sitemaphost.app/ai-crawlers/
# Last reviewed: Mon May 18 2026 02:00:00 GMT+0200 (Central European Summer Time)

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Bytespider
User-agent: CCBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: DuckAssistBot
Disallow: /

Caveats: Google-Extended and Applebot-Extended are opt-out tokens, not separate fetchers. They tell the main Googlebot / Applebot crawlers that your content is off-limits for AI training, while still letting Search index you. If you want to opt out of search indexing too, also block Googlebot and Applebot — but be aware this removes you from Google and Apple Search results.

Block only training, keep RAG citations

If you want ChatGPT, Claude, and Perplexity to cite your pages in live answers (and drive traffic) but don't want them training models on you, block only the training-corpus crawlers:

# Block training crawlers; allow live-citation crawlers

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: meta-externalagent
User-agent: Bytespider
User-agent: CCBot
User-agent: cohere-ai
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# These continue to fetch (and cite) your pages:
# OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot,
# PerplexityBot, Perplexity-User, DuckAssistBot, Diffbot

This is the recommendation for most B2B SaaS, marketing sites, and content businesses — AI citations are a fast-growing traffic source (see Plausible-tracked sources for sitemaphost.app itself) and blocking RAG crawlers cuts you off from that. Only training-corpus blocking matters for the "I don't want to be training data" goal.

How to detect AI crawler traffic

If you want to measure rather than block, log requests by User-Agent header and bucket them by crawler family. The user-agents in the table above are case-sensitive on their distinguishing token; matching with a case-insensitive substring is the safest approach.

A simple pseudo-rule in nginx / Cloudflare Workers / your access-log pipeline:

const ai = /GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|meta-externalagent|Bytespider|CCBot|cohere-ai|Diffbot|DuckAssistBot/i;
if (ai.test(request.headers.get('user-agent') || '')) {
  // log to analytics, add 'ai-crawler' header, etc.
}

For a higher-confidence signal you can also verify the source IP against the operator's published ranges (linked in the table). OpenAI and Anthropic publish JSON files; Google-Extended uses Google's main ranges; some operators don't publish IP lists at all, in which case the user-agent string is the only signal.

Maintenance and citation

This page is reviewed quarterly (Jan, Apr, Jul, Oct). The last review date is at the top. Open a pull request at github.com/jlhernando/sitemaphost or email us if you spot an error or a new crawler we should include.

License: CC BY 4.0 — free to cite and reproduce with attribution.

Hosting your sitemap with AI crawlers in mind?

SitemapHost co-hosts llms.txt alongside your XML sitemap, serves both from your own domain (sitemap.yourdomain.com), and welcomes every reputable AI crawler by default.

Get Started Free