Your robots.txt file is quietly deciding your future in AI search.
That simple text file sitting at the root of your website tells every crawler on the internet what it can and can't access. And right now, the crawlers that matter most aren't from Google — they're from OpenAI, Anthropic, Perplexity, and a growing list of AI companies whose bots determine whether your content gets cited in ChatGPT, Claude, or any of the other AI answer engines reshaping how people find information online.
The problem? Most website owners have no idea what their robots.txt actually says to these bots. Some are accidentally blocking all AI crawlers with a single wildcard rule they added years ago. Others are wide open when they'd prefer to maintain control. And many are missing directives entirely for crawlers that didn't exist six months ago.
This guide walks you through exactly how to check your AI crawler permissions — manually, for free, in the next five minutes.
Step 1: Find Your robots.txt File
Your robots.txt file lives at the root of your domain. To view it, simply add /robots.txt to the end of your homepage URL:
https://yourdomain.com/robots.txt
Open this in your browser. You'll see a plain text file with directives that look something like this:
User-agent: *
Disallow: /admin/
Disallow: /private/
User-agent: Googlebot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
If you get a 404 error, your site doesn't have a robots.txt file at all — which means every crawler (including AI bots) has unrestricted access to your entire site by default.
Step 2: Understand What You're Looking At
The robots.txt syntax is simple but easy to misread. Here's what matters:
User-agent: specifies which crawler the following rules apply to. The asterisk (*) is a wildcard that means "all crawlers."
Disallow: tells the specified crawler it cannot access the given path. Disallow: / means "block everything."
Allow: explicitly permits access to a path, useful for creating exceptions within a broader block.
The critical thing to understand: rules are processed in order of specificity. If you have a rule for User-agent: * followed by a rule for User-agent: GPTBot, the GPTBot-specific rule takes precedence for that crawler.
The Complete AI Crawler Reference
Here's every major AI crawler you need to know about, organized by company. Each entry includes the user-agent token (what you put in robots.txt), what it does, and whether you probably want to allow or block it.
OpenAI (ChatGPT)
OpenAI operates three distinct crawlers, each serving a different purpose:
GPTBot
User-agent: GPTBot
Purpose: Collects content for training future AI models (GPT-4, GPT-5, etc.)
Respects robots.txt: Yes
Recommendation: Block if you don't want your content used for model training. Allow if you want to be part of ChatGPT's knowledge base.
OAI-SearchBot
User-agent: OAI-SearchBot
Purpose: Powers ChatGPT's real-time search and citation features
Respects robots.txt: Yes
Recommendation: Allow if you want to appear in ChatGPT search results and receive referral traffic.
ChatGPT-User
User-agent: ChatGPT-User
Purpose: Fetches content when users explicitly ask ChatGPT to browse a URL
Respects robots.txt: Yes
Recommendation: Allow unless you have specific reasons to block user-initiated browsing.
This separation matters. You can block GPTBot (no training) while allowing OAI-SearchBot (yes to search citations) — giving you control over how your content is used without disappearing from ChatGPT entirely.
Anthropic (Claude)
Anthropic uses multiple user-agents that have evolved over time:
ClaudeBot
User-agent: ClaudeBot
Purpose: Primary crawler for training Claude models and fetching citations during chat
Respects robots.txt: Yes
Recommendation: Allow if you want visibility in Claude responses.
anthropic-ai
User-agent: anthropic-ai
Purpose: Bulk data collection for model training
Respects robots.txt: Yes
Recommendation: Block if you want to prevent training data collection while potentially allowing other Anthropic access.
Claude-Web
User-agent: Claude-Web
Purpose: Web-focused crawling (purpose not fully documented)
Respects robots.txt: Yes
Recommendation: Include in your directives for complete Anthropic coverage.
Important note: Anthropic consolidated multiple crawlers into ClaudeBot in 2024. Sites that only blocked the older anthropic-ai user-agent may have inadvertently given ClaudeBot unrestricted access.
Perplexity
PerplexityBot
User-agent: PerplexityBot
Purpose: Indexes websites to build Perplexity's search database
Respects robots.txt: Yes (officially)
Recommendation: Allow if you want to appear in Perplexity search results.
Perplexity-User
User-agent: Perplexity-User
Purpose: Fetches content when users provide specific URLs as context
Respects robots.txt: Sometimes ignores directives for user-provided URLs
Recommendation: Be aware this bot may bypass robots.txt in user-initiated scenarios.
Perplexity has faced controversy over crawler behavior. Their documentation now states that Perplexity-User can ignore robots.txt when a user provides a specific URL — essentially treating user-initiated requests differently from automated crawling.
Google (Gemini and AI Overviews)
Google-Extended
User-agent: Google-Extended
Purpose: Controls whether your content is used to train Gemini and other Google AI models
Respects robots.txt: Yes
Recommendation: Block to prevent training usage while maintaining normal search visibility.
GoogleOther
User-agent: GoogleOther
Purpose: Miscellaneous Google product crawling (R&D, one-off projects)
Respects robots.txt: Yes
Recommendation: Block unless you want to participate in experimental Google products.
Critical distinction: Google-Extended is a control token, not a traditional crawler. Blocking it doesn't prevent crawling — it prevents usage of already-crawled content for AI training. Your normal search rankings through Googlebot are unaffected.
Apple
Applebot-Extended
User-agent: Applebot-Extended
Purpose: Collects content for Apple Intelligence, Siri, and other Apple AI features
Respects robots.txt: Yes
Recommendation: Allow if you want visibility in Apple's AI ecosystem.
Note: If your robots.txt mentions Googlebot but not Applebot, Apple's crawler will follow Googlebot's rules by default.
Meta
Meta-ExternalAgent
User-agent: Meta-ExternalAgent
Purpose: Gathers data for Meta's AI models and products
Respects robots.txt: Yes
Recommendation: Block unless you want content used in Meta's AI training.
Amazon
Amazonbot
User-agent: Amazonbot
Purpose: Powers Alexa responses and Amazon's AI features
Respects robots.txt: Yes (also respects noarchive meta tag)
Recommendation: Allow if you want to appear in Alexa answers.
Common Crawl
CCBot
User-agent: CCBot
Purpose: Creates open web archives used by many AI companies for training data
Respects robots.txt: Yes
Recommendation: Block to prevent your content from entering Common Crawl datasets, which are widely used for AI training across the industry.
Common Crawl is nonprofit, but its datasets have been foundational for training models from OpenAI, Anthropic, and many others. Blocking CCBot is a broad-spectrum approach to limiting training data collection.
Other Notable Crawlers
- Bytespider (ByteDance/TikTok) — AI training and content analysis
- cohere-ai (Cohere) — Enterprise AI model training
- DuckAssistBot (DuckDuckGo) — AI-assisted search features
- YouBot (You.com) — AI search engine indexing
Step 3: Check Your Current AI Bot Permissions
Now that you know what to look for, scan your robots.txt for these user-agents. You're checking for three scenarios:
Scenario A: Explicit rules exist
If you see directives like:
User-agent: GPTBot
Disallow: /
The bot is explicitly blocked from your entire site.
Scenario B: No specific rules, but wildcard exists
If your file only has:
User-agent: *
Allow: /
All AI crawlers have full access (along with every other bot).
Scenario C: Wildcard blocks everything
If you have:
User-agent: *
Disallow: /
Every crawler — including all AI bots — is blocked. This might be intentional, but it also blocks search engines.
Common Mistakes That Break AI Crawler Access
After auditing hundreds of robots.txt files, these are the errors I see most often:
Mistake 1: Forgetting the Disallow Line
This does nothing:
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
You listed the bots but gave them no instructions. Without a Disallow: or Allow: directive, the rule is meaningless. Always include at least one directive after each user-agent block:
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
Disallow: /
Mistake 2: Case Sensitivity Issues
User-agent matching is case-insensitive in most implementations, but not all. Some crawlers are picky. Use the exact casing from the official documentation:
GPTBot(notgptbotorGPTBOT)ClaudeBot(notclaudebotorClaudebot)PerplexityBot(notperplexitybot)
Mistake 3: Blocking Training But Not Realizing Search Is Separate
If you block GPTBot to prevent training, you might assume you're invisible to ChatGPT entirely. But OAI-SearchBot is a separate crawler. Users can still find your site through ChatGPT's search feature unless you block that too.
Decide what you actually want:
- Block training only: Block
GPTBot, allowOAI-SearchBotandChatGPT-User - Block everything OpenAI: Block all three
Mistake 4: Outdated Crawler Lists
AI companies launch new crawlers constantly. If your robots.txt was last updated in 2023, you're missing directives for:
OAI-SearchBot(launched late 2024)Claude-WebPerplexity-UserGoogle-ExtendedMeta-ExternalAgent
Review and update your directives quarterly at minimum.
Mistake 5: Wildcard Rules That Override Specific Rules
Order matters in some parsers. If you have:
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
Most crawlers will correctly apply the GPTBot-specific rule. But some may not. For maximum compatibility, list specific bot rules before wildcard rules.
Sample robots.txt Configurations
Here are ready-to-use configurations for common scenarios:
Allow Everything (Maximum AI Visibility)
# Allow all AI crawlers for maximum visibility
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
Allow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Block Training, Allow Search Citations
# Block training crawlers
User-agent: GPTBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Google-Extended
User-agent: Meta-ExternalAgent
User-agent: Bytespider
Disallow: /
# Allow search and citation crawlers
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Amazonbot
User-agent: Applebot-Extended
Allow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Block All AI Crawlers
# Block all known AI crawlers
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
User-agent: cohere-ai
User-agent: DuckAssistBot
User-agent: YouBot
Disallow: /
# Allow traditional search engines
User-agent: Googlebot
User-agent: Bingbot
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
The Catch: Robots.txt Is a Request, Not a Barrier
Here's the uncomfortable truth: robots.txt is a voluntary protocol. It's a polite request, not a wall. Major AI companies like OpenAI, Anthropic, and Google officially state they respect robots.txt directives — and they generally do.
But smaller or less scrupulous crawlers may ignore your rules entirely. And even compliant crawlers only check robots.txt periodically — changes aren't instant.
For truly sensitive content, robots.txt alone isn't enough. You'll need server-level blocks, authentication, or IP-based restrictions. But for controlling how the major AI platforms interact with your public content, robots.txt remains the standard mechanism.
Skip the Manual Checking
Manually auditing your robots.txt against a dozen AI crawlers is tedious. And the landscape changes constantly — new bots appear, companies consolidate user-agents, and what worked last quarter may have gaps today.
Skip the manual checking — AEO Tester scans all AI bot permissions in one click. Enter any URL and instantly see which AI crawlers can access your site, which are blocked, and whether your configuration matches your actual intent.
Because in a world where AI visibility increasingly depends on crawler access, knowing what your robots.txt actually says isn't optional — it's essential.
Scan Your AI Crawler Permissions
Free Chrome extension. One-click audit of every AI bot.
Add to Chrome — It's Free