
The robots.txt file has become, in 2026, the first lever for steering your visibility in AI engines. Properly configured, it lets GPTBot, ClaudeBot, PerplexityBot and Google-Extended index your content so it can be cited in ChatGPT, Claude, Perplexity or Gemini. Misconfigured, it simply makes you invisible in the synthesized answers now consulted by 800 million weekly users (Sam Altman, OpenAI DevDay, October 2025). Here is the complete guide to configuring your robots.txt and the new llms.txt protocol in 2026.
The bottom line
Robots.txt is a text file placed at the root of a site (https://yoursite.com/robots.txt) that tells web robots what they can and cannot visit. With the arrival of LLMs, its role has changed: it is no longer just about managing Googlebot and Bingbot, but also controlling a dozen AI crawlers whose decisions now determine your visibility in 69% of searches that end without a click (SimilarWeb, 2025).
Historically, robots.txt served a single purpose: preventing Googlebot from indexing pages with no SEO value (cart, admin area, thank-you pages). In 2026, the file must handle three distinct use cases that need to be clearly understood before touching a single line.
The distinction is crucial. Blocking a training bot does not have the same effect as blocking a retrieval bot. The first protects your content from being learned by future models. The second simply prevents you from appearing in ChatGPT's next answer to your customer. To dig deeper into this mechanism, see our dossier How AI chooses its sources: mechanisms and strategies.
Our field observation. Based on our 2025-2026 audits at PingPrime, across 27 Belgian sites audited, 41% still block GPTBot by default and 32% block ClaudeBot, most often without knowing it. The cause is almost always the same: an SEO plugin or a CMS theme that added a rule "for security" during a migration. The result: the brand is invisible in ChatGPT and Claude while paying a substantial GEO content budget.
A dozen AI crawlers are enough to cover most generative search in 2026. Not all carry the same strategic weight: ChatGPT concentrates 800M weekly users (OpenAI, October 2025) and Perplexity 780M monthly queries (Perplexity, May 2025), while other bots like Bytespider or Amazonbot are peripheral for a B2B European brand. Here is the up-to-date list.
Three families to remember. Retrieval bots (ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User) are the most critical: blocking them amounts to forbidding your brand from appearing in live AI answers. Training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) have a more diffuse but long-term impact: they determine what the models "know" about you. Emerging agent bots (which act on behalf of a user, as in OpenAI Operator) are worth watching in 2026 but still carry little volume.
To understand which sources each platform favors, read our comparison ChatGPT Search vs Google AI Overviews vs Perplexity.
The recommended configuration in 2026 is to explicitly allow all major AI crawlers on your public content, and only block sensitive areas (customer portals, carts, accounts). It is the default posture of most sites ranking well in AI answers. According to Bain & Company's findings, 80% of users rely on AI summaries for at least 40% of their queries (Bain & Company, 2025): cutting yourself off from them is no longer a neutral option.
Here is the block to add to your robots.txt to give access to all relevant AI crawlers in 2026, while blocking the usual private areas.
# --- robots.txt PingPrime — GEO 2026 configuration ---
# Classic Google and Bing bots
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# OpenAI — GPTBot (training), ChatGPT-User and OAI-SearchBot (retrieval)
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic — ClaudeBot and anthropic-ai
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
# Perplexity — PerplexityBot (index) and Perplexity-User (retrieval)
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Google Gemini training
User-agent: Google-Extended
Allow: /
# Microsoft Copilot
User-agent: CopilotBot
Allow: /
# Common Crawl (feeds several open source LLMs)
User-agent: CCBot
Allow: /
# Apple Intelligence
User-agent: Applebot-Extended
Allow: /
# Amazon (Alexa, Rufus)
User-agent: Amazonbot
Allow: /
# ByteDance (allow or block depending on your audience)
User-agent: Bytespider
Allow: /
# Global rules
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /account/
Disallow: /checkout/
Disallow: /*?*sessionid=
Sitemap: https://yoursite.com/sitemap.xml
Some brands (media, publishers, premium sites) want to block model training while allowing real-time reading to remain citable. This posture is coherent: you refuse to feed future models with your content, but you let retrieval bots read your pages when a user asks a question.
# --- robots.txt — block training, allow retrieval ---
# OpenAI: block training, allow retrieval
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic: block training (ClaudeBot does both, careful)
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Perplexity: allow (Perplexity does not do training)
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Google Gemini training: block
User-agent: Google-Extended
Disallow: /
# Common Crawl: block (often used for training)
User-agent: CCBot
Disallow: /
User-agent: *
Allow: /
Caution: Anthropic's ClaudeBot is used for both training and retrieval. Blocking it amounts to cutting yourself off from Claude entirely. This is a business trade-off to validate upfront. To better understand Claude's uses, read our guide How to appear in Claude AI.
The "block everything" block is technically simple, but it amounts to choosing total invisibility in AI engines. We advise against it except in very specific cases (internal sites, intranets, content under strict NDA).
# --- To avoid: total invisibility in AIs ---
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
The choice between blocking and allowing AI crawlers is not neutral. According to Adobe Analytics, retail traffic from AI sources surged +693% year-over-year during the 2025 holiday season (Digital Commerce 360, January 2026) and these visitors convert +31% better than other sources. Blocking amounts to closing the fastest-growing acquisition channel of the moment.
For the vast majority of companies (B2B, e-commerce, SaaS, services, SMBs), the logic is clear: allow. Three main reasons justify it.
Some cases justify targeted blocking. They are a minority but real.
Our field reading. For the brands we support, the "block or allow" decision is made in less than an hour when the question is properly framed. The practical rule: if your GEO strategy aims to be cited by AIs, you allow. If it aims to monetize rare content, you negotiate and block in the meantime. But the "I block by default without thinking" option is almost always a technical accident, not a strategic choice.
To frame this choice within a global roadmap, see our complete GEO audit method: complete guide to evaluating your AI visibility. And if you want to discuss it with a human, our team offers a 12-week GEO support program.
llms.txt is a protocol proposed in September 2024 by Jeremy Howard (Answer.AI) to give LLMs a structured and hierarchical map of a site's content. Where robots.txt says what bots can visit, llms.txt says what they should read in priority. The format is being adopted quickly: according to the public llmstxt.directory tracker, more than 2,000 sites had published an llms.txt by the end of 2025, and the pace accelerated in the first quarter of 2026.
The llms.txt file, placed at the root of the site (https://yoursite.com/llms.txt), follows a lightweight Markdown format designed to be readable by both a human and an LLM. It lists key pages as commented links.
# PingPrime.ai
> PingPrime is a Belgian consulting firm specializing in GEO (Generative Engine
> Optimization). We help brands become citable
> by ChatGPT, Perplexity, Claude and Google AI Overviews.
## Main documentation
- [What is GEO](https://www.pingprime.ai/blog-presse/qu-est-ce-que-le-geo): complete guide to Generative Engine Optimization
- [GEO glossary](https://www.pingprime.ai/glossaire-geo): reference vocabulary
- [Optimization for AI engines](https://www.pingprime.ai/blog-presse/optimisation-moteurs-ia): 2026 pillar guide
## Practical guides
- [Answer-First page](https://www.pingprime.ai/blog-presse/page-answer-first-ia): structuring a page to be cited
- [Schema Markup for GEO](https://www.pingprime.ai/blog-presse/schema-markup-geo-guide): structured markup
- [Robots.txt and AI crawlers](https://www.pingprime.ai/blog-presse/robots-txt-crawlers-ia): 2026 configuration
## Studies and data
- [7 in 10 Belgians use AI](https://www.pingprime.ai/blog-presse/7-belges-10-ia-generative-etude): Semactic + PingPrime study
- [State of GEO in 2026](https://www.pingprime.ai/blog-presse/etat-geo-2026-tendances): trends and figures
## Optional
- [About](https://www.pingprime.ai/a-propos): team and founders
- [Case studies](https://www.pingprime.ai/etude-de-cas): client results
The syntax is simple. An H1 title (the site name), a description paragraph, thematic H2 sections, and bulleted links with a short description. The ## Optional section lists secondary pages an LLM can ignore if context-limited.
The llms-full.txt (or llms.full.txt) format goes further: it directly includes the complete content of key pages, in Markdown, in a single file. This is useful for LLMs that cannot follow links, or for agents wanting to ingest a site in one request. Anthropic, Cloudflare, Mintlify, Stripe and many technical SaaS already use it.
Our recommendation at PingPrime: yes, but without expecting miracles in the short term. The protocol is not (yet) officially supported by OpenAI, Google or Anthropic. There is therefore no guarantee that it influences AI citations today. But three reasons make it a reasonable investment.
Do not replace your robots.txt with an llms.txt. The two play complementary roles: robots.txt manages rights, llms.txt suggests priorities. See our guide to optimizing for AI engines to integrate these files into a complete GEO strategy.
Verification happens in three steps: test the robots.txt, read server logs, and monitor actual citations. According to BrightEdge, 48% of Google queries trigger an AI Overview on average at the end of 2025, with peaks at 88% in health and 82% in B2B Tech (BrightEdge, 2025). Without monitoring AI bot visits on your pages, you are flying blind.
Before anything else, verify what your robots.txt actually allows. Three free tools are enough.
curl https://yoursite.com/robots.txt to retrieve the actual file served by your server (sometimes different from what the CMS says).Server logs are the ground truth. They record every visit from AI bots with their user-agent. On Apache or Nginx, filter logs on the following strings:
# Search for visits from main AI bots in Nginx
grep -E "GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|PerplexityBot|Perplexity-User|Google-Extended|CCBot" /var/log/nginx/access.log
# Count hits per bot over the last 30 days
awk '/GPTBot/ {gpt++} /ClaudeBot/ {claude++} /PerplexityBot/ {perp++} END {print "GPTBot:", gpt, "ClaudeBot:", claude, "PerplexityBot:", perp}' access.log
On a healthy site, you should see regular visits from GPTBot and CCBot (training, weekly), more frequent visits from ChatGPT-User and Perplexity-User (retrieval, triggered by user queries), and visits from PerplexityBot (continuous indexing). The total absence of a bot for 30 days is a red flag.
Bot presence is necessary but not sufficient. You also need to verify whether your pages are actually cited in AI answers. Several approaches exist for this.
chatgpt.com, perplexity.ai, claude.ai, copilot.microsoft.com, gemini.google.com.Our field observation. Based on our audits, the lag between "my robots.txt allows the bots" and "my pages are actually cited by AIs" is several weeks, sometimes more. Count 4 to 8 weeks after a robots.txt overhaul to observe the first citations in Perplexity, and 8 to 16 weeks for ChatGPT. Steering must therefore be patient and monthly, not weekly.
To set up a complete tracking approach, see our complete guide to monitoring AI citations and our free tools page that brings together several steering resources.
Not immediately, but yes in the medium term. GPTBot is used to train future versions of GPT. Blocking it prevents your content from being absorbed into the next generation of the model. For real-time retrieval, watch out for ChatGPT-User and OAI-SearchBot. With 800M weekly users (OpenAI, October 2025), blocking those last two cuts you off from the live citation flow.
No. Robots.txt is a voluntary standard, not a law. Major players (OpenAI, Anthropic, Google, Perplexity) publicly respect it, but other scrapers or open source models can ignore it. For real legal protection, you need to combine robots.txt, legal notices (Terms of Service prohibiting AI scraping) and possibly technical measures (rate limiting, WAF, Cloudflare AI Labyrinth). According to Bain & Company, 60% of searches end without a click in 2025, which makes the defensive posture increasingly costly.
Googlebot indexes the web for classic Google search (and feeds AI Overviews via this index). Google-Extended is a distinct user-agent that controls the use of content for training Gemini and improving AI features. Blocking Google-Extended does not affect your Google SEO or your appearance in AI Overviews in the short term, but limits Gemini's learning about your brand. To understand the Gemini and AI Mode stakes, see our dossier Google AI Mode: what it changes for your visibility.
Yes, they are complementary. Sitemap.xml lists all URLs for search engines, with no value hierarchy. Llms.txt selects priority pages for LLMs, with a natural language description. For a brand seeking to be cited, llms.txt acts as an "FAQ for LLMs." On the Answer-First format that maximizes your chances of citation, see our guide Structuring an Answer-First page to be cited by AI.
In 2026, configuring your robots.txt correctly is no longer a sysadmin task, it is a strategic marketing decision. Blocking GPTBot, ClaudeBot or PerplexityBot means choosing invisibility in channels that now drain 800M weekly ChatGPT users, 780M monthly Perplexity queries and nearly 50% of Google queries triggering an AI Overview. Conversely, allowing these bots while publishing a clear llms.txt positions you as a candidate source for citation.
First concrete action to take this week: retrieve your current robots.txt (curl https://yoursite.com/robots.txt), verify that no major AI crawler is accidentally blocked, and publish a minimal llms.txt listing your 10 to 15 strategic pages.
To go further, two resources: our complete GEO audit guide covering robots.txt, llms.txt, schema and editorial structure, and our pillar guide to optimization for AI engines. If you want to audit your configuration with our team, contact PingPrime.