What Is robots.txt and How Do AI Crawlers Use It?
A robots.txt file is a plain text file at the root of your website (yourdomain.com/robots.txt) that tells web crawlers which pages they can and can't visit. It was originally built for search engine bots like Googlebot, but it now controls access for a growing number of AI crawlers — including GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and many others.
When any AI crawler visits your site, the first thing it does is read your robots.txt file. If it finds a Disallow: / rule for its user-agent, it won't crawl your content. That means your pages won't show up in AI-generated answers, won't be included in model training data, and won't appear in AI-powered search results.
The problem is that most website owners have no idea what their robots.txt says about AI bots — or whether it says anything at all. Many sites are either accidentally blocking AI crawlers they want to allow, or wide open to bots they'd rather keep out. This AI robots.txt checker lets you find out in seconds which AI bots can currently access your site.
AI Training Bots vs. AI Search Bots — The Key Difference
Not all AI crawlers do the same thing. Before you decide what to block or allow, it helps to understand the two main types.
Training Bots
Training bots scrape your content to build datasets for AI model training. Once your content enters the training data, it gets blended into the model's knowledge — future answers are generated from it without direct attribution or a link back to your page.
Common training bots include GPTBot (OpenAI), CCBot (Common Crawl), Google-Extended (Google), Bytespider (ByteDance/TikTok), and Meta-ExternalAgent (Meta).
AI Search Bots
AI search bots fetch your content in real time when a user asks a question. They pull relevant information from your page and typically cite you as the source with a direct link. This is essentially referral traffic from AI search engines.
Common AI search bots include PerplexityBot (Perplexity), OAI-SearchBot (ChatGPT Search), Applebot-Extended (Apple/Siri), and ChatGPT-User (activated when a user explicitly browses a link in ChatGPT).
What This Means for Your Strategy
For most websites, the smart move is to allow AI search bots (so you get cited and linked when AI platforms answer questions your content covers) while making an informed decision about training bots based on your content licensing and business goals. Some publishers block all training bots but keep search bots open. Others allow everything to maximize exposure.
There's no universally right answer — but you need to know where you stand. Use this checker to see exactly which bots your current robots.txt blocks or allows.
Complete List of AI Crawlers Checked
This tool checks your robots.txt against the following AI bots. The list is updated as new crawlers emerge.
| Bot Name |
Operator |
Type |
What It Does |
| GPTBot |
OpenAI |
Training |
Collects data for training GPT models |
| ChatGPT-User |
OpenAI |
Search/Browse |
Fetches pages when users click links in ChatGPT |
| OAI-SearchBot |
OpenAI |
Search |
Indexes content for ChatGPT Search results |
| ClaudeBot |
Anthropic |
Training |
Collects data for training Claude models |
| anthropic-ai |
Anthropic |
Training |
Older Anthropic training crawler |
| PerplexityBot |
Perplexity |
Search |
Fetches content for real-time AI search answers |
| Google-Extended |
Google |
Training |
Controls use of content for Gemini/AI training |
| Applebot-Extended |
Apple |
Search |
Powers AI features in Siri and Apple Intelligence |
| CCBot |
Common Crawl |
Training |
Open dataset used by many AI companies for training |
| Bytespider |
ByteDance |
Training |
Collects data for TikTok/ByteDance AI models |
| Meta-ExternalAgent |
Meta |
Training |
Collects data for Meta's AI products |
| Amazonbot |
Amazon |
Search/Training |
Powers Alexa and Amazon AI features |
| cohere-ai |
Cohere |
Training |
Collects data for Cohere language models |
| DuckAssistBot |
DuckDuckGo |
Search |
Powers DuckDuckGo's AI-assisted answers |
| YouBot |
You.com |
Search |
Indexes content for You.com AI search |
How to Block or Allow Specific AI Crawlers
Once you've checked your robots.txt with this tool, you might want to make changes. Here's how.
Block a Specific AI Bot
Add these lines to your robots.txt file to prevent a specific crawler from accessing your site:
User-agent: GPTBot
Disallow: /
Replace GPTBot with any bot name from the list above. Each bot needs its own block — you can't combine them on one line.
Allow AI Search Bots but Block Training Bots
This is the most popular configuration for publishers who want AI visibility without contributing training data:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Allow AI search crawlers (for citations and referral traffic)
User-agent: PerplexityBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Applebot-Extended
Allow: /
Block All AI Crawlers
If you want to prevent any AI bot from crawling your content, you'll need to list each one individually. There is no single wildcard that targets only AI bots. See the full list above and add a Disallow: / block for each.
After Making Changes
Save your robots.txt file and come back to this tool to verify the changes took effect. Bots will respect the updated rules on their next visit — there's no way to force an immediate recrawl.
Checking Your AI Visibility Beyond robots.txt
Your robots.txt file is the first thing AI crawlers check, but it's not the only factor that determines whether your content shows up in AI-generated answers. A complete AI visibility audit also looks at:
- Structured data (Schema markup): AI systems rely on structured data to understand what your page is about. Proper schema markup helps bots extract accurate information from your content.
- Semantic HTML and heading structure: Clear, hierarchical headings (H1, H2, H3) help AI crawlers parse the structure of your content and pull relevant sections for answers.
- Content clarity and answer-readiness: AI search engines prefer content that directly answers questions in a clear, concise format. Pages structured around specific questions tend to get cited more often.
- Crawl accessibility: Beyond robots.txt, issues like slow load times, JavaScript rendering requirements, or authentication walls can prevent AI bots from accessing your content.
These factors together determine your AEO (Answer Engine Optimization) score — a measure of how ready your site is to appear in AI-generated answers across platforms like ChatGPT, Perplexity, Google AI Overviews, and Claude.
Want a full AI visibility check that covers all of this? Install the AI Visibility Tool Chrome extension for a complete analysis.