Robots.txt and AI crawlers: master your visibility in 2026

Blog Author Image
Olivier de Decker
Blog Author Image
27/5/2026
Blog Thimble Image

The robots.txt file has become, in 2026, the first lever for steering your visibility in AI engines. Properly configured, it lets GPTBot, ClaudeBot, PerplexityBot and Google-Extended index your content so it can be cited in ChatGPT, Claude, Perplexity or Gemini. Misconfigured, it simply makes you invisible in the synthesized answers now consulted by 800 million weekly users (Sam Altman, OpenAI DevDay, October 2025). Here is the complete guide to configuring your robots.txt and the new llms.txt protocol in 2026.

The bottom line

  • Blocking GPTBot, ClaudeBot or PerplexityBot in robots.txt = becoming invisible in ChatGPT, Claude and Perplexity, i.e. potentially 80% of users who rely on AI summaries (Bain & Company, 2025).
  • You need to distinguish three families of bots: training bots (model training), retrieval bots (real-time reading to answer) and agent bots (action initiated by a user).
  • The llms.txt protocol, proposed in 2024, is being adopted quickly: it complements robots.txt by exposing to LLMs a structured map of priority content.
  • According to our PingPrime audits across 27 Belgian sites, 41% still block GPTBot by default without realizing it, often due to a CMS theme or a poorly configured SEO plugin.

What is robots.txt for when it comes to AI crawlers?

Robots.txt is a text file placed at the root of a site (https://yoursite.com/robots.txt) that tells web robots what they can and cannot visit. With the arrival of LLMs, its role has changed: it is no longer just about managing Googlebot and Bingbot, but also controlling a dozen AI crawlers whose decisions now determine your visibility in 69% of searches that end without a click (SimilarWeb, 2025).

Historically, robots.txt served a single purpose: preventing Googlebot from indexing pages with no SEO value (cart, admin area, thank-you pages). In 2026, the file must handle three distinct use cases that need to be clearly understood before touching a single line.

  • Classic indexing (SEO): allowing or blocking Googlebot, Bingbot, DuckDuckBot for the traditional SERP.
  • AI model training: allowing or blocking GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, CCBot. These bots vacuum up content to train future versions of the models.
  • Real-time retrieval: allowing or blocking ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User, CopilotBot. These bots read your pages when a user asks a question, so the AI can cite you in its answer.

The distinction is crucial. Blocking a training bot does not have the same effect as blocking a retrieval bot. The first protects your content from being learned by future models. The second simply prevents you from appearing in ChatGPT's next answer to your customer. To dig deeper into this mechanism, see our dossier How AI chooses its sources: mechanisms and strategies.

Our field observation. Based on our 2025-2026 audits at PingPrime, across 27 Belgian sites audited, 41% still block GPTBot by default and 32% block ClaudeBot, most often without knowing it. The cause is almost always the same: an SEO plugin or a CMS theme that added a rule "for security" during a migration. The result: the brand is invisible in ChatGPT and Claude while paying a substantial GEO content budget.

Which AI crawlers should you know in 2026?

A dozen AI crawlers are enough to cover most generative search in 2026. Not all carry the same strategic weight: ChatGPT concentrates 800M weekly users (OpenAI, October 2025) and Perplexity 780M monthly queries (Perplexity, May 2025), while other bots like Bytespider or Amazonbot are peripheral for a B2B European brand. Here is the up-to-date list.

  • User-agent|Company|Main use|GEO impact if blocked
  • GPTBot|OpenAI|Training of GPT models|No direct impact on citations, but loss of long-term memory in future models
  • ChatGPT-User|OpenAI|Retrieval initiated by a user in ChatGPT|Critical: your site is no longer read in real time by ChatGPT
  • OAI-SearchBot|OpenAI|Indexing for SearchGPT and the integrated engine|Critical: disappearance from the SearchGPT index
  • ClaudeBot|Anthropic|Claude retrieval and training|Critical: invisibility in Claude and claude.ai
  • anthropic-ai|Anthropic|Legacy Anthropic crawler|Still active, allow for consistency
  • PerplexityBot|Perplexity|Perplexity indexing|Critical: disappearance from the Perplexity index
  • Perplexity-User|Perplexity|Real-time retrieval when a user asks a question|Critical: no citation in Perplexity answers
  • Google-Extended|Google|Gemini training and AI features improvement|Indirect: affects Gemini and certain AI Overviews uses
  • Bytespider|ByteDance (TikTok)|Doubao training and ByteDance AI|Marginal for EU/B2B audience, relevant for Asia/Gen Z audience
  • CCBot|Common Crawl|Public dataset used by many open source LLMs|Indirect but extensive: feeds Mistral, LLaMA, etc.
  • Bingbot and CopilotBot|Microsoft|Bing indexing + Microsoft 365 Copilot|Impact on Copilot, Bing Chat and the Microsoft ecosystem
  • Applebot-Extended|Apple|Apple Intelligence training|Relevant for iOS audience and Siri
  • Amazonbot|Amazon|Powers Alexa and Rufus (shopping assistant)|Relevant for e-commerce and large retail

Three families to remember. Retrieval bots (ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User) are the most critical: blocking them amounts to forbidding your brand from appearing in live AI answers. Training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) have a more diffuse but long-term impact: they determine what the models "know" about you. Emerging agent bots (which act on behalf of a user, as in OpenAI Operator) are worth watching in 2026 but still carry little volume.

To understand which sources each platform favors, read our comparison ChatGPT Search vs Google AI Overviews vs Perplexity.

How to configure robots.txt to allow AI crawlers?

The recommended configuration in 2026 is to explicitly allow all major AI crawlers on your public content, and only block sensitive areas (customer portals, carts, accounts). It is the default posture of most sites ranking well in AI answers. According to Bain & Company's findings, 80% of users rely on AI summaries for at least 40% of their queries (Bain & Company, 2025): cutting yourself off from them is no longer a neutral option.

Recommended configuration: allow all major AI crawlers

Here is the block to add to your robots.txt to give access to all relevant AI crawlers in 2026, while blocking the usual private areas.

# --- robots.txt PingPrime — GEO 2026 configuration ---

# Classic Google and Bing bots
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# OpenAI — GPTBot (training), ChatGPT-User and OAI-SearchBot (retrieval)
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic — ClaudeBot and anthropic-ai
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Perplexity — PerplexityBot (index) and Perplexity-User (retrieval)
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Google Gemini training
User-agent: Google-Extended
Allow: /

# Microsoft Copilot
User-agent: CopilotBot
Allow: /

# Common Crawl (feeds several open source LLMs)
User-agent: CCBot
Allow: /

# Apple Intelligence
User-agent: Applebot-Extended
Allow: /

# Amazon (Alexa, Rufus)
User-agent: Amazonbot
Allow: /

# ByteDance (allow or block depending on your audience)
User-agent: Bytespider
Allow: /

# Global rules
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /account/
Disallow: /checkout/
Disallow: /*?*sessionid=

Sitemap: https://yoursite.com/sitemap.xml

Restrictive configuration: block training but keep retrieval

Some brands (media, publishers, premium sites) want to block model training while allowing real-time reading to remain citable. This posture is coherent: you refuse to feed future models with your content, but you let retrieval bots read your pages when a user asks a question.

# --- robots.txt — block training, allow retrieval ---

# OpenAI: block training, allow retrieval
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic: block training (ClaudeBot does both, careful)
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Perplexity: allow (Perplexity does not do training)
User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Google Gemini training: block
User-agent: Google-Extended
Disallow: /

# Common Crawl: block (often used for training)
User-agent: CCBot
Disallow: /

User-agent: *
Allow: /

Caution: Anthropic's ClaudeBot is used for both training and retrieval. Blocking it amounts to cutting yourself off from Claude entirely. This is a business trade-off to validate upfront. To better understand Claude's uses, read our guide How to appear in Claude AI.

Aggressive configuration: block everything (to be avoided in 99% of cases)

The "block everything" block is technically simple, but it amounts to choosing total invisibility in AI engines. We advise against it except in very specific cases (internal sites, intranets, content under strict NDA).

# --- To avoid: total invisibility in AIs ---

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Blocking or allowing AI crawlers: what are the strategic implications?

The choice between blocking and allowing AI crawlers is not neutral. According to Adobe Analytics, retail traffic from AI sources surged +693% year-over-year during the 2025 holiday season (Digital Commerce 360, January 2026) and these visitors convert +31% better than other sources. Blocking amounts to closing the fastest-growing acquisition channel of the moment.

Why most brands should allow

For the vast majority of companies (B2B, e-commerce, SaaS, services, SMBs), the logic is clear: allow. Three main reasons justify it.

  • AI traffic converts massively better. ChatGPT referrers convert at 15.9% versus 1.76% for organic Google according to Search Engine Land, 2025, a 9-to-1 ratio.
  • The opportunity cost is massive. With 800M ChatGPT users and 780M Perplexity queries per month, refusing to appear in AI answers means withdrawing from a growth channel.
  • Content stays public anyway. If your pages are accessible to Googlebot, they are accessible to humans, and therefore to LLMs through other channels (indirect datasets, captures, citations on third-party sites).

Legitimate cases for blocking (or partial blocking)

Some cases justify targeted blocking. They are a minority but real.

  • Media and paid publishers negotiating licensing contracts with OpenAI or Anthropic and wanting to maximize their commercial leverage.
  • Brands with high-value proprietary content (patented methodologies, exclusive research, qualified databases) where absorption by a model poses a competitive risk.
  • Sites subject to strict GDPR or sensitive content (health, finance, personal data) where distribution via an LLM generates compliance risk.
  • Sites undergoing GEO redesign where you temporarily block to prevent a bad version from being indexed by LLMs.

Our field reading. For the brands we support, the "block or allow" decision is made in less than an hour when the question is properly framed. The practical rule: if your GEO strategy aims to be cited by AIs, you allow. If it aims to monetize rare content, you negotiate and block in the meantime. But the "I block by default without thinking" option is almost always a technical accident, not a strategic choice.

To frame this choice within a global roadmap, see our complete GEO audit method: complete guide to evaluating your AI visibility. And if you want to discuss it with a human, our team offers a 12-week GEO support program.

What is llms.txt and should you adopt it in 2026?

llms.txt is a protocol proposed in September 2024 by Jeremy Howard (Answer.AI) to give LLMs a structured and hierarchical map of a site's content. Where robots.txt says what bots can visit, llms.txt says what they should read in priority. The format is being adopted quickly: according to the public llmstxt.directory tracker, more than 2,000 sites had published an llms.txt by the end of 2025, and the pace accelerated in the first quarter of 2026.

llms.txt format and example

The llms.txt file, placed at the root of the site (https://yoursite.com/llms.txt), follows a lightweight Markdown format designed to be readable by both a human and an LLM. It lists key pages as commented links.

# PingPrime.ai

> PingPrime is a Belgian consulting firm specializing in GEO (Generative Engine
> Optimization). We help brands become citable
> by ChatGPT, Perplexity, Claude and Google AI Overviews.

## Main documentation

- [What is GEO](https://www.pingprime.ai/blog-presse/qu-est-ce-que-le-geo): complete guide to Generative Engine Optimization
- [GEO glossary](https://www.pingprime.ai/glossaire-geo): reference vocabulary
- [Optimization for AI engines](https://www.pingprime.ai/blog-presse/optimisation-moteurs-ia): 2026 pillar guide

## Practical guides

- [Answer-First page](https://www.pingprime.ai/blog-presse/page-answer-first-ia): structuring a page to be cited
- [Schema Markup for GEO](https://www.pingprime.ai/blog-presse/schema-markup-geo-guide): structured markup
- [Robots.txt and AI crawlers](https://www.pingprime.ai/blog-presse/robots-txt-crawlers-ia): 2026 configuration

## Studies and data

- [7 in 10 Belgians use AI](https://www.pingprime.ai/blog-presse/7-belges-10-ia-generative-etude): Semactic + PingPrime study
- [State of GEO in 2026](https://www.pingprime.ai/blog-presse/etat-geo-2026-tendances): trends and figures

## Optional

- [About](https://www.pingprime.ai/a-propos): team and founders
- [Case studies](https://www.pingprime.ai/etude-de-cas): client results

The syntax is simple. An H1 title (the site name), a description paragraph, thematic H2 sections, and bulleted links with a short description. The ## Optional section lists secondary pages an LLM can ignore if context-limited.

Variant: llms-full.txt

The llms-full.txt (or llms.full.txt) format goes further: it directly includes the complete content of key pages, in Markdown, in a single file. This is useful for LLMs that cannot follow links, or for agents wanting to ingest a site in one request. Anthropic, Cloudflare, Mintlify, Stripe and many technical SaaS already use it.

Should you adopt it in 2026?

Our recommendation at PingPrime: yes, but without expecting miracles in the short term. The protocol is not (yet) officially supported by OpenAI, Google or Anthropic. There is therefore no guarantee that it influences AI citations today. But three reasons make it a reasonable investment.

  • Near-zero cost: a 2 to 5 KB text file to put at the root, updated monthly.
  • Bottom-up adoption: Mintlify, Anthropic, Cloudflare, Vercel, Stripe, Hugging Face and many tech players have adopted the format. The standard is being built through practice.
  • Strong symbolic benefit: clear signal of intent for the models, highlighting of priority content, useful base for AI agents seeking to understand your site.

Do not replace your robots.txt with an llms.txt. The two play complementary roles: robots.txt manages rights, llms.txt suggests priorities. See our guide to optimizing for AI engines to integrate these files into a complete GEO strategy.

How to verify that your site is properly crawled by AIs?

Verification happens in three steps: test the robots.txt, read server logs, and monitor actual citations. According to BrightEdge, 48% of Google queries trigger an AI Overview on average at the end of 2025, with peaks at 88% in health and 82% in B2B Tech (BrightEdge, 2025). Without monitoring AI bot visits on your pages, you are flying blind.

Step 1: test robots.txt with a parser

Before anything else, verify what your robots.txt actually allows. Three free tools are enough.

  • Google Search Console: use the "Crawl statistics" report to see if Googlebot actually accesses your key pages.
  • Robots.txt Tester (technicalseo.com): lets you simulate any user-agent (GPTBot, ClaudeBot, PerplexityBot) and verify whether a URL is allowed.
  • Command-line curl: curl https://yoursite.com/robots.txt to retrieve the actual file served by your server (sometimes different from what the CMS says).

Step 2: analyze server logs

Server logs are the ground truth. They record every visit from AI bots with their user-agent. On Apache or Nginx, filter logs on the following strings:

# Search for visits from main AI bots in Nginx
grep -E "GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|PerplexityBot|Perplexity-User|Google-Extended|CCBot" /var/log/nginx/access.log

# Count hits per bot over the last 30 days
awk '/GPTBot/ {gpt++} /ClaudeBot/ {claude++} /PerplexityBot/ {perp++} END {print "GPTBot:", gpt, "ClaudeBot:", claude, "PerplexityBot:", perp}' access.log

On a healthy site, you should see regular visits from GPTBot and CCBot (training, weekly), more frequent visits from ChatGPT-User and Perplexity-User (retrieval, triggered by user queries), and visits from PerplexityBot (continuous indexing). The total absence of a bot for 30 days is a red flag.

Step 3: monitor actual citations

Bot presence is necessary but not sufficient. You also need to verify whether your pages are actually cited in AI answers. Several approaches exist for this.

  • Monthly manual tests on 50 to 100 strategic queries in your category, in ChatGPT, Perplexity, Google AI Overviews and Claude.
  • Specialized monitoring tools: Profound, Otterly, AthenaHQ, Peec.ai, or the in-house solutions we develop at PingPrime.
  • Referrer tracking in GA4: filter sessions coming from chatgpt.com, perplexity.ai, claude.ai, copilot.microsoft.com, gemini.google.com.

Our field observation. Based on our audits, the lag between "my robots.txt allows the bots" and "my pages are actually cited by AIs" is several weeks, sometimes more. Count 4 to 8 weeks after a robots.txt overhaul to observe the first citations in Perplexity, and 8 to 16 weeks for ChatGPT. Steering must therefore be patient and monthly, not weekly.

To set up a complete tracking approach, see our complete guide to monitoring AI citations and our free tools page that brings together several steering resources.

Frequently asked questions about robots.txt and AI crawlers

Does blocking GPTBot really prevent ChatGPT from citing me?

Not immediately, but yes in the medium term. GPTBot is used to train future versions of GPT. Blocking it prevents your content from being absorbed into the next generation of the model. For real-time retrieval, watch out for ChatGPT-User and OAI-SearchBot. With 800M weekly users (OpenAI, October 2025), blocking those last two cuts you off from the live citation flow.

Is robots.txt legally binding?

No. Robots.txt is a voluntary standard, not a law. Major players (OpenAI, Anthropic, Google, Perplexity) publicly respect it, but other scrapers or open source models can ignore it. For real legal protection, you need to combine robots.txt, legal notices (Terms of Service prohibiting AI scraping) and possibly technical measures (rate limiting, WAF, Cloudflare AI Labyrinth). According to Bain & Company, 60% of searches end without a click in 2025, which makes the defensive posture increasingly costly.

What is the difference between Google-Extended and Googlebot?

Googlebot indexes the web for classic Google search (and feeds AI Overviews via this index). Google-Extended is a distinct user-agent that controls the use of content for training Gemini and improving AI features. Blocking Google-Extended does not affect your Google SEO or your appearance in AI Overviews in the short term, but limits Gemini's learning about your brand. To understand the Gemini and AI Mode stakes, see our dossier Google AI Mode: what it changes for your visibility.

Should you publish an llms.txt even if you already have a sitemap.xml?

Yes, they are complementary. Sitemap.xml lists all URLs for search engines, with no value hierarchy. Llms.txt selects priority pages for LLMs, with a natural language description. For a brand seeking to be cited, llms.txt acts as an "FAQ for LLMs." On the Answer-First format that maximizes your chances of citation, see our guide Structuring an Answer-First page to be cited by AI.

Conclusion: robots.txt has become a top-tier GEO tool

In 2026, configuring your robots.txt correctly is no longer a sysadmin task, it is a strategic marketing decision. Blocking GPTBot, ClaudeBot or PerplexityBot means choosing invisibility in channels that now drain 800M weekly ChatGPT users, 780M monthly Perplexity queries and nearly 50% of Google queries triggering an AI Overview. Conversely, allowing these bots while publishing a clear llms.txt positions you as a candidate source for citation.

First concrete action to take this week: retrieve your current robots.txt (curl https://yoursite.com/robots.txt), verify that no major AI crawler is accidentally blocked, and publish a minimal llms.txt listing your 10 to 15 strategic pages.

To go further, two resources: our complete GEO audit guide covering robots.txt, llms.txt, schema and editorial structure, and our pillar guide to optimization for AI engines. If you want to audit your configuration with our team, contact PingPrime.

Summary
AI in Customer Service
Benefits of AI Chatbots
Use Cases
Integrating AI
Final  Thoughts
Get our GEO 2026 checklist
Learn how to finally get cited by AI.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.