How to Check If AI Crawlers Can Access Your Website

Your robots.txt file is quietly deciding your future in AI search.

That simple text file sitting at the root of your website tells every crawler on the internet what it can and can't access. And right now, the crawlers that matter most aren't from Google — they're from OpenAI, Anthropic, Perplexity, and a growing list of AI companies whose bots determine whether your content gets cited in ChatGPT, Claude, or any of the other AI answer engines reshaping how people find information online.

The problem? Most website owners have no idea what their robots.txt actually says to these bots. Some are accidentally blocking all AI crawlers with a single wildcard rule they added years ago. Others are wide open when they'd prefer to maintain control. And many are missing directives entirely for crawlers that didn't exist six months ago.

This guide walks you through exactly how to check your AI crawler permissions — manually, for free, in the next five minutes.

Step 1: Find Your robots.txt File

Your robots.txt file lives at the root of your domain. To view it, simply add /robots.txt to the end of your homepage URL:

https://yourdomain.com/robots.txt

Open this in your browser. You'll see a plain text file with directives that look something like this:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

If you get a 404 error, your site doesn't have a robots.txt file at all — which means every crawler (including AI bots) has unrestricted access to your entire site by default.

Step 2: Understand What You're Looking At

The robots.txt syntax is simple but easy to misread. Here's what matters:

User-agent: specifies which crawler the following rules apply to. The asterisk (*) is a wildcard that means "all crawlers."

Disallow: tells the specified crawler it cannot access the given path. Disallow: / means "block everything."

Allow: explicitly permits access to a path, useful for creating exceptions within a broader block.

The critical thing to understand: rules are processed in order of specificity. If you have a rule for User-agent: * followed by a rule for User-agent: GPTBot, the GPTBot-specific rule takes precedence for that crawler.

The Complete AI Crawler Reference

Here's every major AI crawler you need to know about, organized by company. Each entry includes the user-agent token (what you put in robots.txt), what it does, and whether you probably want to allow or block it.

OpenAI (ChatGPT)

OpenAI operates three distinct crawlers, each serving a different purpose:

GPTBot

User-agent: GPTBot

Purpose: Collects content for training future AI models (GPT-4, GPT-5, etc.)

Respects robots.txt: Yes

Recommendation: Block if you don't want your content used for model training. Allow if you want to be part of ChatGPT's knowledge base.

OAI-SearchBot

User-agent: OAI-SearchBot

Purpose: Powers ChatGPT's real-time search and citation features

Respects robots.txt: Yes

Recommendation: Allow if you want to appear in ChatGPT search results and receive referral traffic.

ChatGPT-User

User-agent: ChatGPT-User

Purpose: Fetches content when users explicitly ask ChatGPT to browse a URL

Respects robots.txt: Yes

Recommendation: Allow unless you have specific reasons to block user-initiated browsing.

This separation matters. You can block GPTBot (no training) while allowing OAI-SearchBot (yes to search citations) — giving you control over how your content is used without disappearing from ChatGPT entirely.

Anthropic (Claude)

Anthropic uses multiple user-agents that have evolved over time:

ClaudeBot

User-agent: ClaudeBot

Purpose: Primary crawler for training Claude models and fetching citations during chat

Respects robots.txt: Yes

Recommendation: Allow if you want visibility in Claude responses.

anthropic-ai

User-agent: anthropic-ai

Purpose: Bulk data collection for model training

Respects robots.txt: Yes

Recommendation: Block if you want to prevent training data collection while potentially allowing other Anthropic access.

Claude-Web

User-agent: Claude-Web

Purpose: Web-focused crawling (purpose not fully documented)

Respects robots.txt: Yes

Recommendation: Include in your directives for complete Anthropic coverage.

Important note: Anthropic consolidated multiple crawlers into ClaudeBot in 2024. Sites that only blocked the older anthropic-ai user-agent may have inadvertently given ClaudeBot unrestricted access.

Perplexity

PerplexityBot

User-agent: PerplexityBot

Purpose: Indexes websites to build Perplexity's search database

Respects robots.txt: Yes (officially)

Recommendation: Allow if you want to appear in Perplexity search results.

Perplexity-User

User-agent: Perplexity-User

Purpose: Fetches content when users provide specific URLs as context

Respects robots.txt: Sometimes ignores directives for user-provided URLs

Recommendation: Be aware this bot may bypass robots.txt in user-initiated scenarios.

Perplexity has faced controversy over crawler behavior. Their documentation now states that Perplexity-User can ignore robots.txt when a user provides a specific URL — essentially treating user-initiated requests differently from automated crawling.

Google (Gemini and AI Overviews)

Google-Extended

User-agent: Google-Extended

Purpose: Controls whether your content is used to train Gemini and other Google AI models

Respects robots.txt: Yes

Recommendation: Block to prevent training usage while maintaining normal search visibility.

GoogleOther

User-agent: GoogleOther

Purpose: Miscellaneous Google product crawling (R&D, one-off projects)

Respects robots.txt: Yes

Recommendation: Block unless you want to participate in experimental Google products.

Critical distinction: Google-Extended is a control token, not a traditional crawler. Blocking it doesn't prevent crawling — it prevents usage of already-crawled content for AI training. Your normal search rankings through Googlebot are unaffected.

Apple

Applebot-Extended

User-agent: Applebot-Extended

Purpose: Collects content for Apple Intelligence, Siri, and other Apple AI features

Respects robots.txt: Yes

Recommendation: Allow if you want visibility in Apple's AI ecosystem.

Note: If your robots.txt mentions Googlebot but not Applebot, Apple's crawler will follow Googlebot's rules by default.

Amazon

Amazonbot

User-agent: Amazonbot

Purpose: Powers Alexa responses and Amazon's AI features

Respects robots.txt: Yes (also respects noarchive meta tag)

Recommendation: Allow if you want to appear in Alexa answers.

Common Crawl

CCBot

User-agent: CCBot

Purpose: Creates open web archives used by many AI companies for training data

Respects robots.txt: Yes

Recommendation: Block to prevent your content from entering Common Crawl datasets, which are widely used for AI training across the industry.

Common Crawl is nonprofit, but its datasets have been foundational for training models from OpenAI, Anthropic, and many others. Blocking CCBot is a broad-spectrum approach to limiting training data collection.

Other Notable Crawlers

Bytespider (ByteDance/TikTok) — AI training and content analysis
cohere-ai (Cohere) — Enterprise AI model training
DuckAssistBot (DuckDuckGo) — AI-assisted search features
YouBot (You.com) — AI search engine indexing

Step 3: Check Your Current AI Bot Permissions

Now that you know what to look for, scan your robots.txt for these user-agents. You're checking for three scenarios:

Scenario A: Explicit rules exist

If you see directives like:

User-agent: GPTBot
Disallow: /

The bot is explicitly blocked from your entire site.

Scenario B: No specific rules, but wildcard exists

If your file only has:

User-agent: *
Allow: /

All AI crawlers have full access (along with every other bot).

Scenario C: Wildcard blocks everything

If you have:

User-agent: *
Disallow: /

Every crawler — including all AI bots — is blocked. This might be intentional, but it also blocks search engines.

Common Mistakes That Break AI Crawler Access

After auditing hundreds of robots.txt files, these are the errors I see most often:

Mistake 1: Forgetting the Disallow Line

This does nothing:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot

You listed the bots but gave them no instructions. Without a Disallow: or Allow: directive, the rule is meaningless. Always include at least one directive after each user-agent block:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
Disallow: /

Mistake 2: Case Sensitivity Issues

User-agent matching is case-insensitive in most implementations, but not all. Some crawlers are picky. Use the exact casing from the official documentation:

GPTBot (not gptbot or GPTBOT)
ClaudeBot (not claudebot or Claudebot)
PerplexityBot (not perplexitybot)

Mistake 3: Blocking Training But Not Realizing Search Is Separate

If you block GPTBot to prevent training, you might assume you're invisible to ChatGPT entirely. But OAI-SearchBot is a separate crawler. Users can still find your site through ChatGPT's search feature unless you block that too.

Decide what you actually want:

Block training only: Block GPTBot, allow OAI-SearchBot and ChatGPT-User
Block everything OpenAI: Block all three

Mistake 4: Outdated Crawler Lists

AI companies launch new crawlers constantly. If your robots.txt was last updated in 2023, you're missing directives for:

OAI-SearchBot (launched late 2024)
Claude-Web
Perplexity-User
Google-Extended
Meta-ExternalAgent

Review and update your directives quarterly at minimum.

Mistake 5: Wildcard Rules That Override Specific Rules

Order matters in some parsers. If you have:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

Most crawlers will correctly apply the GPTBot-specific rule. But some may not. For maximum compatibility, list specific bot rules before wildcard rules.

Sample robots.txt Configurations

Here are ready-to-use configurations for common scenarios:

Allow Everything (Maximum AI Visibility)

# Allow all AI crawlers for maximum visibility
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block Training, Allow Search Citations

# Block training crawlers
User-agent: GPTBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Google-Extended
User-agent: Meta-ExternalAgent
User-agent: Bytespider
Disallow: /

# Allow search and citation crawlers
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Amazonbot
User-agent: Applebot-Extended
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Block All AI Crawlers

# Block all known AI crawlers
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: Claude-Web
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: Amazonbot
User-agent: CCBot
User-agent: Bytespider
User-agent: cohere-ai
User-agent: DuckAssistBot
User-agent: YouBot
Disallow: /

# Allow traditional search engines
User-agent: Googlebot
User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

The Catch: Robots.txt Is a Request, Not a Barrier

Here's the uncomfortable truth: robots.txt is a voluntary protocol. It's a polite request, not a wall. Major AI companies like OpenAI, Anthropic, and Google officially state they respect robots.txt directives — and they generally do.

But smaller or less scrupulous crawlers may ignore your rules entirely. And even compliant crawlers only check robots.txt periodically — changes aren't instant.

For truly sensitive content, robots.txt alone isn't enough. You'll need server-level blocks, authentication, or IP-based restrictions. But for controlling how the major AI platforms interact with your public content, robots.txt remains the standard mechanism.

Skip the Manual Checking

Manually auditing your robots.txt against a dozen AI crawlers is tedious. And the landscape changes constantly — new bots appear, companies consolidate user-agents, and what worked last quarter may have gaps today.

Skip the manual checking — AEO Tester scans all AI bot permissions in one click. Enter any URL and instantly see which AI crawlers can access your site, which are blocked, and whether your configuration matches your actual intent.

Because in a world where AI visibility increasingly depends on crawler access, knowing what your robots.txt actually says isn't optional — it's essential.

Scan Your AI Crawler Permissions

Free Chrome extension. One-click audit of every AI bot.

Add to Chrome — It's Free