Robots.txt Generator - Block AI Crawlers & Configure SEO

Disallow Paths

Sitemap URL

AI Crawler Rules

GPTBot
ClaudeBot
CCBot
Google-Extended
Bytespider
PerplexityBot

Additional Options

How to Use the Robots.txt Generator

Start by choosing whether to allow all crawlers by default. Add any paths you want to disallow, such as admin panels, private directories, or API endpoints. Enter your sitemap URL so crawlers can find your full site structure. Use the AI crawler toggles to individually allow, disallow, or block specific AI training crawlers. Optionally enable Googlebot-specific rules and set a crawl delay. Click Generate to produce the robots.txt content, then copy and upload it to your website root.

Understanding Robots.txt

The robots.txt file is one of the oldest and most fundamental web standards for controlling how automated programs interact with your website. Created in 1994, the Robots Exclusion Protocol provides a simple text-based format that crawlers read before accessing your site. Every major search engine, AI company, and web scraping tool checks for this file.

How Robots.txt Works

When a crawler visits your website, the first thing it does is request the /robots.txt file. The file contains rules organized by User-agent (the crawler’s identity) with Allow and Disallow directives for specific URL paths. If no robots.txt exists, crawlers assume they can access everything. If the file exists but contains no rules for a particular crawler, that crawler also assumes full access.

Syntax and Structure

Each block in a robots.txt file starts with a User-agent line identifying which crawler the rules apply to. The wildcard asterisk matches all crawlers. Following the User-agent line, Disallow directives specify paths the crawler should not access, while Allow directives explicitly permit access to specific paths within a disallowed directory.

Path matching uses simple prefix matching. A Disallow of /admin blocks /admin, /admin/login, /admin/settings, and any other URL starting with /admin. The slash alone (/) represents the entire site, so Disallow: / blocks everything.

Blocking AI Crawlers

The rapid growth of AI language models has made robots.txt more relevant than ever. AI companies send crawlers across the web to gather training data, and many website owners want to control whether their content is used for this purpose.

Known AI Crawler User Agents

GPTBot is OpenAI’s web crawler used to gather training data for GPT models. ClaudeBot is Anthropic’s crawler for Claude’s training data. CCBot crawls for Common Crawl, a massive open web archive used by many AI companies. Google-Extended is Google’s separate crawler for AI training, distinct from Googlebot which handles search indexing. Bytespider is ByteDance’s crawler, and PerplexityBot crawls for the Perplexity AI search engine.

Blocking these crawlers individually gives you fine-grained control. You might choose to block AI training crawlers while keeping search engine crawlers fully enabled, or you might allow some AI companies access while blocking others.

SEO Considerations

Your robots.txt file directly impacts how search engines crawl and index your site. Blocking important pages accidentally can remove them from search results entirely. Conversely, allowing crawlers to access low-value pages wastes your crawl budget, which is the number of pages search engines will crawl in a given time period.

Crawl Budget Optimization

Large websites benefit from strategic robots.txt configuration. Block faceted navigation URLs, internal search result pages, tag archives with duplicate content, print-friendly page versions, and staging or development directories. This focuses crawler attention on your most important content.

Common Paths to Disallow

Typical paths to block include /admin, /wp-admin, /cgi-bin, /tmp, and any directories containing duplicate content or internal tools. For WordPress sites, blocking /wp-admin while allowing /wp-admin/admin-ajax.php is a common pattern that prevents admin page indexing while preserving AJAX functionality.

Robots.txt and Meta Tags Working Together

The robots.txt file controls access at the URL path level, while the robots meta tag (configured per-page in your HTML head) controls indexing at the page level. Use the Meta Tag Generator to set per-page robots directives that complement your robots.txt rules. For example, you might allow crawling of a page in robots.txt but set a noindex meta tag to prevent it from appearing in search results.

Ensure your website also has a proper privacy policy that discloses your use of analytics and tracking, since the same crawlers you configure in robots.txt may also be relevant to your data collection disclosures.

Frequently Asked Questions

What is a robots.txt file and where does it go?

A robots.txt file is a plain text file placed at the root of your website (e.g., example.com/robots.txt) that tells web crawlers which pages or sections they are allowed or not allowed to access. It follows the Robots Exclusion Protocol, a standard that nearly all legitimate crawlers respect. The file must be named exactly 'robots.txt' and placed in the root directory.

Can robots.txt block AI crawlers like GPTBot and ClaudeBot?

Yes, you can block AI company crawlers by adding specific User-agent rules to your robots.txt file. For example, adding 'User-agent: GPTBot' followed by 'Disallow: /' will tell OpenAI's crawler not to index your content. Similarly, ClaudeBot (Anthropic), CCBot (Common Crawl), Google-Extended (Google AI), Bytespider (ByteDance), and PerplexityBot can all be blocked individually.

Does robots.txt actually prevent crawling?

Robots.txt is a voluntary standard, not a security mechanism. Well-behaved crawlers from major companies (Google, Bing, OpenAI, Anthropic) respect robots.txt directives. However, malicious bots may ignore it entirely. If you need to truly prevent access to content, use server-side authentication or access controls instead. Robots.txt is best understood as a request to crawlers, not an enforceable restriction.

Should I add a sitemap to my robots.txt?

Yes, including a Sitemap directive in your robots.txt is a widely recommended SEO practice. It points crawlers directly to your XML sitemap, which lists all the pages you want indexed. This helps search engines discover your content more efficiently, especially for large sites or pages that are not well-linked internally. The sitemap URL must be the full absolute URL.

What is a crawl delay and should I use one?

The Crawl-delay directive tells crawlers to wait a specified number of seconds between requests. This can reduce server load from aggressive crawling. However, Google does not support the Crawl-delay directive (use Google Search Console instead), while Bing, Yandex, and some other crawlers do respect it. Use it sparingly, as excessive delays can prevent timely indexing of new content.