Reference Last verified

AI Crawler Reference 2026

Every known AI crawler in one table — user-agent string, operator, purpose, whether it respects robots.txt, JavaScript rendering support, and links to operator-published IP ranges. Refreshed quarterly. Skip to opt-out templates or how to detect AI traffic.

CrawlerUser-agentOperatorPurposerobots.txtRenders JSPrimary source
GPTBotGPTBotOpenAITraining data for ChatGPT / GPT-4+ familyYesNoOpenAI docs
OAI-SearchBotOAI-SearchBotOpenAIIndexes pages for ChatGPT Search (RAG live citations)YesYesOpenAI docs
ChatGPT-UserChatGPT-UserOpenAIOn-demand fetch when a ChatGPT user includes a URL in their promptYesYesOpenAI docs
ClaudeBotClaudeBotAnthropicTraining data for ClaudeYesNoAnthropic docs
Claude-UserClaude-UserAnthropicOn-demand fetch when a Claude user shares a URLYesYesAnthropic docs
Claude-SearchBotClaude-SearchBotAnthropicRAG indexing for Claude's web-search featureYesYesAnthropic docs
PerplexityBotPerplexityBotPerplexityTraining and RAG indexingYesYesPerplexity docs
Perplexity-UserPerplexity-UserPerplexityOn-demand fetch when a Perplexity user asks a question that triggers a URL fetchDisputed*YesPerplexity docs
Google-ExtendedGoogle-Extended (token)GoogleOpt-out token for Gemini / Bard training (not a fetcher; signals Googlebot)Yesn/aGoogle docs
Applebot-ExtendedApplebot-Extended (token)AppleOpt-out token for Apple Intelligence training (signals Applebot)Yesn/aApple docs
Meta-ExternalAgentmeta-externalagentMetaTraining data for Llama and Meta AIYesNoMeta docs
BytespiderBytespiderByteDanceTraining data for Doubao and TikTok-side modelsDisputed*NoByteDance page
CCBotCCBotCommon CrawlPublic web archive used by many LLM training pipelines (indirect)YesNoCommon Crawl
cohere-aicohere-aiCohereTraining data for Command-family modelsYesNoCohere docs
DiffbotDiffbotDiffbotBuilds the Diffbot Knowledge Graph (used by enterprise RAG systems)YesYesDiffbot docs
DuckAssistBotDuckAssistBotDuckDuckGoRAG for DuckDuckGo's DuckAssist AI answersYesYesDDG help

* Disputed: independent reports (e.g. Wired's 2024 Perplexity investigation, multiple webmasters on Bytespider) have shown these crawlers fetching pages even after a User-agent: {Name} / Disallow: / rule was in place. The operators dispute the findings or claim the behaviour is a bug. Treat with skepticism; firewall-level blocks may be needed if the user-agent rule is ignored.

Opt out of AI training and indexing

Add this block to your robots.txt at the root of your domain (e.g., https://example.com/robots.txt). It tells every reputable AI crawler not to fetch any URL on your site:

# Block AI crawlers — generated by sitemaphost.app/ai-crawlers/
# Last reviewed: Mon May 18 2026 02:00:00 GMT+0200 (Central European Summer Time)

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: meta-externalagent
User-agent: Bytespider
User-agent: CCBot
User-agent: cohere-ai
User-agent: Diffbot
User-agent: DuckAssistBot
Disallow: /

Caveats: Google-Extended and Applebot-Extended are opt-out tokens, not separate fetchers. They tell the main Googlebot / Applebot crawlers that your content is off-limits for AI training, while still letting Search index you. If you want to opt out of search indexing too, also block Googlebot and Applebot — but be aware this removes you from Google and Apple Search results.

Block only training, keep RAG citations

If you want ChatGPT, Claude, and Perplexity to cite your pages in live answers (and drive traffic) but don't want them training models on you, block only the training-corpus crawlers:

# Block training crawlers; allow live-citation crawlers

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: meta-externalagent
User-agent: Bytespider
User-agent: CCBot
User-agent: cohere-ai
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# These continue to fetch (and cite) your pages:
# OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot,
# PerplexityBot, Perplexity-User, DuckAssistBot, Diffbot

This is the recommendation for most B2B SaaS, marketing sites, and content businesses — AI citations are a fast-growing traffic source (see Plausible-tracked sources for sitemaphost.app itself) and blocking RAG crawlers cuts you off from that. Only training-corpus blocking matters for the "I don't want to be training data" goal.

How to detect AI crawler traffic

If you want to measure rather than block, log requests by User-Agent header and bucket them by crawler family. The user-agents in the table above are case-sensitive on their distinguishing token; matching with a case-insensitive substring is the safest approach.

A simple pseudo-rule in nginx / Cloudflare Workers / your access-log pipeline:

const ai = /GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-User|Claude-SearchBot|PerplexityBot|Perplexity-User|meta-externalagent|Bytespider|CCBot|cohere-ai|Diffbot|DuckAssistBot/i;
if (ai.test(request.headers.get('user-agent') || '')) {
  // log to analytics, add 'ai-crawler' header, etc.
}

For a higher-confidence signal you can also verify the source IP against the operator's published ranges (linked in the table). OpenAI and Anthropic publish JSON files; Google-Extended uses Google's main ranges; some operators don't publish IP lists at all, in which case the user-agent string is the only signal.

Maintenance and citation

This page is reviewed quarterly (Jan, Apr, Jul, Oct). The last review date is at the top. Open a pull request at github.com/jlhernando/sitemaphost or email us if you spot an error or a new crawler we should include.

License: CC BY 4.0 — free to cite and reproduce with attribution.

Hosting your sitemap with AI crawlers in mind?

SitemapHost co-hosts llms.txt alongside your XML sitemap, serves both from your own domain (sitemap.yourdomain.com), and welcomes every reputable AI crawler by default.

Get Started Free