Crawler Rules Specification

The robots.txt directives that control which AI search engines can crawl your site — ten known AI crawlers, their user-agent strings, and the access posture the Analyzer expects.

AI crawlers

The Analyzer tracks ten AI crawlers across the major AI search platforms. Each crawler is identified by its User-agent string in robots.txt. The table below lists all ten with their provider and the AI engine they serve.

User-agent	Provider	AI Engine
GPTBot	OpenAI	ChatGPT / GPT-4o
ChatGPT-User	OpenAI	ChatGPT browsing
ClaudeBot	Anthropic	Claude
Claude-Web	Anthropic	Claude web search
PerplexityBot	Perplexity AI	Perplexity
Google-Extended	Google	Gemini / AI Overviews
CCBot	Common Crawl	Training data (multiple)
Bytespider	ByteDance	Doubao / Grok training
Amazonbot	Amazon	Alexa / Amazon Q
meta-externalagent	Meta	Meta AI

User-agent matching is case-insensitive

The Analyzer normalizes user-agent strings to lowercase before matching. A robots.txt entry for GPTBot, gptbot, or Gptbot all match the same crawler.

robots.txt format

To explicitly allow all ten AI crawlers, add a User-agent block for each one with an Allow: / directive. Place these blocks before any wildcard User-agent: * block to ensure they take precedence.

robots.txt

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: *
Disallow: /admin/
Disallow: /private/

Allow: / is explicit, not required

If your robots.txt has no Disallow: / rule for a crawler, access is implicitly allowed. Explicit Allow: / rules are recommended because they make your intent unambiguous and are easier to audit.

Crawler classification

The Analyzer classifies each of the ten crawlers into one of three states based on the parsed robots.txt rules:

Classification	Definition
allowed	An explicit `Allow: /` rule exists for this crawler, or a `Disallow: /` rule exists but is overridden by `Allow: /` in the same block.
blocked	A `Disallow: /` rule exists for this crawler with no overriding `Allow: /` rule, or the wildcard block has `Disallow: /` and no exact block exists for this crawler.
unspecified	No block exists for this crawler and the wildcard block (if present) does not have a `Disallow: /` rule. Access is implicitly permitted but not explicitly stated.

Precedence rules: an exact User-agent block takes priority over the wildcard User-agent: * block. Within a block, Allow rules can override Disallow rules.

Analyzer checks

The checkCrawlerRules function fetches {origin}/robots.txt, parses it into blocks, and classifies each of the ten AI crawlers:

Condition	Status	Recommendation code
`robots.txt` present, no global block, majority of crawlers not blocked	pass	—
`robots.txt` returns HTTP 404	warn	ADD_ROBOTS_TXT
Majority (≥6 of 10) of AI crawlers blocked	warn	REVIEW_AI_CRAWLER_ACCESS
Global `Disallow: /` in wildcard block, OR all 10 crawlers explicitly blocked	fail	ALLOW_AI_CRAWLERS
Network error fetching `robots.txt`	unknown	—

A global block is defined as a User-agent: * block with Disallow: / and no overriding Allow: /. This blocks all crawlers — including AI search engines — regardless of any other rules in the file.

Common mistakes

Global Disallow: / with no AI-specific Allow rules. A User-agent: * block with Disallow: / blocks every crawler that does not have its own explicit block. If you need to restrict general crawlers but allow AI engines, add individual User-agent blocks for each AI crawler above the wildcard block.
Blocking crawlers by provider instead of user-agent. Some sites block entire IP ranges or use firewall rules to block crawlers. The Analyzer only reads robots.txt — network-level blocks are not visible to it, but they still prevent AI engines from accessing your content.
Missing robots.txt entirely. Without a robots.txt file, AI crawlers have no explicit guidance. The Analyzer returns warn and recommends creating the file. While implicit access is permitted, an explicit file signals that you have considered AI crawler access.
Blocking training crawlers but not search crawlers. CCBot and Bytespider are primarily used for training data collection. Blocking them is a reasonable choice, but blocking GPTBot, ClaudeBot, or PerplexityBot also prevents those engines from indexing your site for search results. Distinguish between training and search crawlers when writing your rules.
Incorrect user-agent casing in the file. While the Analyzer normalizes casing, some robots.txt parsers are case-sensitive. Use the exact casing shown in the table above (GPTBot, not gptbot) to ensure compatibility across all parsers.

PreviousAI Metadata NextStructured Signals