Should You Block AI Crawlers? A Practical robots.txt Guide

For most small and mid-sized businesses, the answer is no — you should not block AI crawlers. Blocking keeps your content out of ChatGPT, Perplexity, and Copilot answers at exactly the moment customers are starting to ask those tools who to hire. Blocking genuinely benefits publishers whose content is the product; for everyone else, AI visibility is typically worth far more than content protection.

Know which crawlers you are actually dealing with

“AI crawlers” is not one thing. Different bots do different jobs, and a robots.txt rule that makes sense for one can quietly hurt you on another. These are the user agents that matter most right now:

GPTBot — OpenAI’s training crawler. It gathers content that may be used to train future models. Blocking it affects training only — not whether you appear in ChatGPT search results.
OAI-SearchBot — OpenAI’s search crawler. It builds the index ChatGPT uses to find and link to websites. Block it and your pages stop showing up as cited sources in ChatGPT answers.
ChatGPT-User — an on-demand fetcher, not a bulk crawler. When a user asks ChatGPT to open or summarize a specific page, this agent retrieves it in real time.
ClaudeBot — Anthropic’s primary crawler for Claude. Anthropic also runs separate user-triggered agents, so check its docs for fine-grained control.
PerplexityBot — builds Perplexity’s search index. Perplexity leans heavily on live web retrieval, so this one directly affects whether your pages get cited in its answers.
Google-Extended — not a separate crawler but a robots.txt token. It tells Google not to use your content for Gemini model training. Blocking it does not affect your Google Search rankings or your eligibility for AI Overviews, which follow normal Googlebot rules.
Bingbot — the standard Bing crawler, which also feeds Microsoft Copilot. Blocking it removes you from Bing search and Copilot answers in one stroke.

The real trade-off: content protection vs. AI visibility

The blocking decision comes down to one question: is your content the product, or is it the marketing?

If you sell access to original content — investigative journalism, proprietary research, paid courses — there is a real argument for blocking training crawlers. Your words have standalone commercial value, and you may not want them absorbed into models for free.

If you are a plumber, a law firm, a dental practice, or a software company, your content exists to bring in customers. An AI assistant quoting your service page is distribution, not theft. The “protection” you gain by blocking is mostly protection from being discovered.

Why most small businesses should allow AI crawlers

Buyers increasingly ask assistants what they used to type into Google: “best HVAC company near me,” “which CRM fits a five-person team.” An assistant’s answer pulls from pages its crawlers can read, plus third-party sources like reviews and directories.

If your pages are blocked, the assistant doesn’t forget you exist — it builds its answer from whatever else it can find. You lose the chance to be the primary source on your own services and service area, which is often the difference between being recommended accurately and being skipped. If you suspect that’s already happening, start with our guide on why your business isn’t showing up in AI search results.

There’s a compounding effect, too. Pages that AI search crawlers can fetch get cited; citations create brand mentions; mentions feed future answers. Our walkthrough on how to get cited by Perplexity covers that loop in detail.

What blocking does — and what it doesn’t

Be clear-eyed about what a Disallow line actually buys you.

It does tell compliant crawlers to stop fetching your pages. The major bots listed above generally honor robots.txt.
It does remove you from AI search indexes over time. OAI-SearchBot and PerplexityBot can’t cite what they can’t crawl.
It does not remove content already collected. Robots.txt is forward-looking; it is not a recall mechanism.
It does not stop bad actors. Scrapers that ignore robots.txt will keep ignoring it. Blocking is a policy signal, not a security control.
It does not make AI forget your business. Models still learn about you from directories, review sites, news coverage, and social profiles — sources you don’t control.

Sample robots.txt policies

Robots.txt lives at the root of each domain (yoursite.com/robots.txt), and every subdomain needs its own file. A missing rule means “allowed,” so an allow-all policy mostly means staying out of the way.

Option 1: allow everything (recommended for most SMBs)

# Allow all crawlers, including AI bots
User-agent: *
Allow: /

Sitemap: https://www.yoursite.com/sitemap.xml

This is the default posture; you don’t need to name AI bots individually to allow them.

Option 2: block training, allow search and live fetches

A middle path: keep your content out of model training while staying visible in AI search and live browsing.

# Block model-training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Keep search indexing and user-triggered fetches open
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /

Option 3: block all AI crawlers

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Note what’s missing: Bingbot. Blocking it would also pull you out of Bing’s regular search results, which is rarely worth it just to stay out of Copilot. If you choose this option, you are opting out of an entire discovery channel, and re-earning citations takes time after you unblock.

How to verify crawler behavior in your logs

Robots.txt is a request. Server logs are how you check who is listening. Access logs record every fetch with its user agent string, so you can see which bots hit which pages and when.

On most Linux servers, one command pulls the AI crawler traffic:

grep -iE "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot" /var/log/nginx/access.log

Three things to check. First, whether blocked bots stop after you change robots.txt — give it a day or two, since crawlers cache the file. Second, which pages the search bots fetch most often — a signal of what’s likely getting cited. Third, whether the traffic is genuine. User agent strings can be spoofed, so verify suspicious hits against the IP ranges OpenAI and other vendors publish, or use reverse DNS for Bingbot.

If you’re on managed WordPress hosting without raw log access, your host’s portal usually exposes access logs or a bot-traffic report. Pair that with how to measure AI search visibility to confirm crawl activity is turning into citations.

How Frostbite helps

Frostbite’s AI visibility service handles this end to end: crawler policy, structured data, content AI tools can cite, and ongoing monitoring of where your business appears in AI answers. We set the robots.txt posture that fits your business instead of pasting a generic template. Contact us for a straight answer on where you stand today.

Frequently asked questions

Does blocking GPTBot remove my site from ChatGPT answers?

Not directly. GPTBot governs training data collection. ChatGPT’s web search results are built by OAI-SearchBot, and user-requested page fetches come from ChatGPT-User. If your goal is to stay visible in ChatGPT while opting out of training, block GPTBot and leave the other two alone.

Will blocking Google-Extended hurt my Google rankings?

No. Google-Extended only controls whether your content is used for Gemini model training. It does not affect crawling for Google Search, your rankings, or whether your pages can appear in AI Overviews — all of that is governed by standard Googlebot directives.

Can I undo a block later without lasting damage?

Mostly, yes. Remove the Disallow lines and compliant crawlers resume fetching; AI search indexes pick your pages back up over time. The catch is the gap: while blocked, assistants answered questions in your category from other sources, and re-earning citations typically takes weeks, not days.