Crawler Rules Specification
The robots.txt directives that control which AI search engines can crawl your site — ten known AI crawlers, their user-agent strings, and the access posture the Analyzer expects.
AI crawlers
The Analyzer tracks ten AI crawlers across the major AI search platforms. Each crawler is identified by its User-agent string in robots.txt. The table below lists all ten with their provider and the AI engine they serve.
| User-agent | Provider | AI Engine |
|---|---|---|
| GPTBot | OpenAI | ChatGPT / GPT-4o |
| ChatGPT-User | OpenAI | ChatGPT browsing |
| ClaudeBot | Anthropic | Claude |
| Claude-Web | Anthropic | Claude web search |
| PerplexityBot | Perplexity AI | Perplexity |
| Google-Extended | Gemini / AI Overviews | |
| CCBot | Common Crawl | Training data (multiple) |
| Bytespider | ByteDance | Doubao / Grok training |
| Amazonbot | Amazon | Alexa / Amazon Q |
| meta-externalagent | Meta | Meta AI |
User-agent matching is case-insensitive
robots.txt entry for GPTBot, gptbot, or Gptbot all match the same crawler.robots.txt format
To explicitly allow all ten AI crawlers, add a User-agent block for each one with an Allow: / directive. Place these blocks before any wildcard User-agent: * block to ensure they take precedence.
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: meta-externalagent
Allow: /
User-agent: *
Disallow: /admin/
Disallow: /private/Allow: / is explicit, not required
robots.txt has no Disallow: / rule for a crawler, access is implicitly allowed. Explicit Allow: / rules are recommended because they make your intent unambiguous and are easier to audit.Crawler classification
The Analyzer classifies each of the ten crawlers into one of three states based on the parsed robots.txt rules:
| Classification | Definition |
|---|---|
| allowed | An explicit Allow: / rule exists for this crawler, or a Disallow: / rule exists but is overridden by Allow: / in the same block. |
| blocked | A Disallow: / rule exists for this crawler with no overriding Allow: / rule, or the wildcard block has Disallow: / and no exact block exists for this crawler. |
| unspecified | No block exists for this crawler and the wildcard block (if present) does not have a Disallow: / rule. Access is implicitly permitted but not explicitly stated. |
Precedence rules: an exact User-agent block takes priority over the wildcard User-agent: * block. Within a block, Allow rules can override Disallow rules.
Analyzer checks
The checkCrawlerRules function fetches {origin}/robots.txt, parses it into blocks, and classifies each of the ten AI crawlers:
| Condition | Status | Recommendation code |
|---|---|---|
robots.txt present, no global block, majority of crawlers not blocked | pass | — |
robots.txt returns HTTP 404 | warn | ADD_ROBOTS_TXT |
| Majority (≥6 of 10) of AI crawlers blocked | warn | REVIEW_AI_CRAWLER_ACCESS |
Global Disallow: / in wildcard block, OR all 10 crawlers explicitly blocked | fail | ALLOW_AI_CRAWLERS |
Network error fetching robots.txt | unknown | — |
A global block is defined as a User-agent: * block with Disallow: / and no overriding Allow: /. This blocks all crawlers — including AI search engines — regardless of any other rules in the file.
Common mistakes
- Global Disallow: / with no AI-specific Allow rules. A
User-agent: *block withDisallow: /blocks every crawler that does not have its own explicit block. If you need to restrict general crawlers but allow AI engines, add individualUser-agentblocks for each AI crawler above the wildcard block. - Blocking crawlers by provider instead of user-agent. Some sites block entire IP ranges or use firewall rules to block crawlers. The Analyzer only reads
robots.txt— network-level blocks are not visible to it, but they still prevent AI engines from accessing your content. - Missing robots.txt entirely. Without a
robots.txtfile, AI crawlers have no explicit guidance. The Analyzer returnswarnand recommends creating the file. While implicit access is permitted, an explicit file signals that you have considered AI crawler access. - Blocking training crawlers but not search crawlers.
CCBotandBytespiderare primarily used for training data collection. Blocking them is a reasonable choice, but blockingGPTBot,ClaudeBot, orPerplexityBotalso prevents those engines from indexing your site for search results. Distinguish between training and search crawlers when writing your rules. - Incorrect user-agent casing in the file. While the Analyzer normalizes casing, some robots.txt parsers are case-sensitive. Use the exact casing shown in the table above (
GPTBot, notgptbot) to ensure compatibility across all parsers.