Does robots.txt prevent pages from appearing in Google?

Blocking a URL in robots.txt prevents Googlebot from crawling it, but does NOT prevent it from appearing in search results. Google can still index a URL it cannot crawl if other pages link to it — it just won't know the content. To completely prevent a page from appearing in search results, use a noindex meta tag in the page's : . Use robots.txt to prevent crawling of low-value pages; use noindex to prevent indexing.

What is the difference between Disallow and noindex?

Disallow in robots.txt instructs crawlers not to visit a URL — they cannot read the content, so they cannot index it from crawling. However, they can still show the URL in results if they learn about it from links. noindex is a directive in the page's HTML that tells crawlers: you can visit and read this page, but do not add it to your index. For most cases where you want a page out of search results, noindex is the correct tool. Use Disallow for pages where crawling wastes crawl budget (duplicate content, filtered pages, paginated parameters).

Should I block AI crawlers in robots.txt?

This is your choice and increasingly common. AI training crawlers like OpenAI's GPTBot and Google's Google-Extended respect robots.txt disallow rules. To block them, add: User-agent: GPTBot, Disallow: / and User-agent: Google-Extended, Disallow: /. Note that not all AI crawlers respect robots.txt — less reputable scrapers may ignore it. Blocking these crawlers only affects those that voluntarily comply.

How to Generate a robots.txt File

Create a properly formatted robots.txt file for your website with our free Robots.txt Generator. Control how search engines crawl your site.

Steps

Add your sitemap URL

Enter the full URL of your XML sitemap (e.g., https://example.com/sitemap.xml). Including the sitemap directive in robots.txt ensures search engine crawlers discover it even if they have not been submitted via Search Console.

Define rules for Googlebot

Add Allow and Disallow rules for Googlebot specifically. Use Disallow: /admin/ to block the admin panel, Disallow: /private/ for private sections, and Disallow: ?* to block all URLs with query strings (useful for preventing crawling of search results pages).

Add rules for other user agents

Add separate sections for other major crawlers: Bingbot, DuckDuckBot, Yandex, and AhrefsBot. You can also add rules for AI training crawlers like GPTBot and Google-Extended if you want to block them from training AI models on your content.

Set crawl delay (optional)

Add a Crawl-delay directive to slow down aggressive crawlers if your server cannot handle the default crawl rate. Note that Googlebot ignores Crawl-delay — use Google Search Console to set Googlebot's crawl rate instead.

Test and deploy

Test your robots.txt rules using Google's robots.txt Tester in Search Console before deploying. Upload the file to the root of your domain (https://yourdomain.com/robots.txt). It must be at the exact root path to be recognised.

robots.txt Syntax and Common Patterns

A robots.txt file consists of groups of directives, each starting with a User-agent line specifying which crawler the rules apply to (* means all crawlers). Each group then contains Allow and Disallow directives with URL paths. Paths are prefix-matched: Disallow: /admin/ blocks all URLs starting with /admin/. Disallow: / blocks everything. Disallow: (empty value) allows everything. The Allow directive overrides Disallow for more specific paths: you can Disallow: /api/ while allowing Allow: /api/public/. Common sections to disallow include: /admin/, /login/, /wp-admin/ (WordPress), /private/, /checkout/, /cart/, /*.pdf$ (to save crawl budget on PDFs), and /search? (to block internal search results pages that create duplicate content).

Crawl Budget and Why robots.txt Matters for Large Sites

Crawl budget is the number of pages Googlebot crawls on your site within a given time period. For small sites (under a few thousand pages), crawl budget is rarely a concern — Google crawls all pages promptly. For large sites with hundreds of thousands of URLs, managing crawl budget becomes important: you want Google spending its crawl time on your most valuable pages rather than on thin parameter pages, duplicate content, pagination, or filtered views. Proper robots.txt configuration, combined with a clean XML sitemap, ensures crawlers prioritise your best content. Signs of crawl budget waste include duplicate pages from URL parameters (sort, filter, tracking parameters), session ID URLs, infinite scroll or calendar navigation, and search results pages.