A robots.txt file tells search engine crawlers which parts of your website they are allowed to access. Configured correctly, it protects server resources and keeps low-value pages out of crawl queues. Configured incorrectly, it can accidentally block Googlebot from your entire site and wipe your rankings overnight. This guide shows you exactly how to configure a robots.txt file safely, with real examples, syntax rules, and the most common mistakes to avoid.
Whether you are setting up a new site, reviewing an inherited codebase, or recovering from a sudden traffic drop, understanding your robots.txt file is one of the highest-leverage technical SEO tasks you can perform.
robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in Google's index if other sites link to it — Google just cannot read its content. To prevent indexing, use a noindex meta tag instead.
What Is a Robots.txt File and How Does It Work?
A robots.txt file is a plain text file placed at the root of your domain — https://yourdomain.com/robots.txt. It follows the Robots Exclusion Protocol, a standard that well-behaved crawlers like Googlebot, Bingbot, and others read before accessing any other part of your site.
How crawlers use robots.txt
Before Googlebot crawls any page on your domain, it fetches and caches your robots.txt file. It then checks every URL it intends to visit against the rules in that file. If a URL is disallowed, Googlebot skips it and moves on — it will not crawl the page, though it may still know the page exists from links.
What robots.txt can and cannot do
| robots.txt CAN do this | robots.txt CANNOT do this |
|---|---|
| Block Googlebot from crawling a URL | Remove a page from Google's index |
| Block specific crawlers by user-agent | Prevent a page from appearing in search if it has inbound links |
| Set your sitemap location | Control what anchor text shows in results |
| Protect admin, staging, and duplicate paths | Block malicious bots (they ignore robots.txt) |
| Manage crawl budget on large sites | Replace HTTPS authentication for sensitive content |
Robots.txt Syntax Rules You Must Know
Understanding the syntax is essential before making any changes. Robots.txt uses a small set of directives, and even a single formatting mistake can produce unexpected results.
The four core directives
- User-agent — specifies which crawler the following rules apply to. Use
*for all crawlers. - Disallow — specifies a path the crawler should not access. An empty Disallow value means allow everything.
- Allow — overrides a Disallow rule for a specific path. Supported by Google, not all crawlers.
- Sitemap — specifies the full URL of your XML sitemap. Recommended on all robots.txt files.
Syntax rules that trip people up
- Each directive goes on its own line — you cannot combine them
- Rules are case-sensitive —
/Admin/and/admin/are treated as different paths - Lines beginning with
#are comments and are ignored by crawlers - Each group of rules must start with a
User-agentline - A blank line separates rule groups for different user-agents
- Wildcards:
*matches any sequence of characters;$anchors to end of URL
# Block all URLs with ?sessionid= parameter Disallow: /*?sessionid= # Block all .pdf files site-wide Disallow: /*.pdf$ # Block all URLs containing /print/ Disallow: */print/*
How to Configure Robots.txt Safely: Step by Step
Here is a reliable, safe process for configuring a robots.txt file that protects your site without accidentally blocking search crawlers.
Step 1 — Start with the safest possible base
If you are not sure what to block, start with an open configuration and add restrictions only where needed. An empty or fully open robots.txt is far safer than one with untested Disallow rules.
User-agent: * Disallow: Sitemap: https://yourdomain.com/sitemap.xml
An empty Disallow: value means allow everything. This is the safest baseline — it tells crawlers they are welcome everywhere and points them to your sitemap.
Step 2 — Identify what to block
Think carefully about which paths should be protected. Good candidates for Disallow rules are paths that are never useful for searchers to find, consume crawl budget without benefit, or expose sensitive functionality.
- CMS admin areas (
/wp-admin/,/admin/,/dashboard/) - Login and registration pages (
/login/,/register/) - Shopping cart and checkout paths (
/cart/,/checkout/) - Internal search results pages (
/search?,/?s=) - Staging or development subdirectories
- Duplicate print-friendly pages (
/print/) - User account pages (
/my-account/,/profile/)
Step 3 — Write and test your rules before publishing
Never publish robots.txt changes without testing them first. Use Google Search Console's robots.txt tester or the URL Inspection tool to verify that specific URLs are not accidentally being blocked by your new rules.
Step 4 — Reference your sitemap
Always include your sitemap URL in your robots.txt file. This helps search engines find your sitemap even before it is submitted via Search Console, and is good practice for every site.
You can include multiple Sitemap lines if you use a sitemap index file or separate sitemaps for different content types. Each goes on its own line: Sitemap: https://yourdomain.com/sitemap-posts.xml
Real-World robots.txt Examples by Site Type
Different types of websites have different crawl budget concerns and protection needs. Here are safe, production-ready configurations for the most common scenarios.
WordPress blog or content site
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-login.php Disallow: /?s= Allow: /wp-admin/admin-ajax.php Sitemap: https://yourdomain.com/sitemap.xml
E-commerce store
User-agent: * Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ Disallow: /order-confirmation/ Disallow: /?add-to-cart= Disallow: /admin/ Sitemap: https://yourdomain.com/sitemap_index.xml
Large site with crawl budget concerns
User-agent: * Disallow: /admin/ Disallow: /internal-search/ Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?page= Disallow: /tag/ Disallow: /author/ # Googlebot gets specific rules User-agent: Googlebot Disallow: /staging/ Allow: / Sitemap: https://yourdomain.com/sitemap_index.xml
The Most Dangerous robots.txt Mistakes to Avoid
These are the mistakes that cause the most serious SEO damage — some of them have wiped entire sites from Google's index for days or weeks before being caught.
User-agent: * Disallow: /
This blocks every crawler from every page on your entire site. It is commonly left in place after development and not noticed until rankings collapse. Always check for this before going live.
Other mistakes that cause serious crawl damage
| Mistake | What It Does | Fix |
|---|---|---|
Disallow: / |
Blocks your entire site | Change to Disallow: (empty) to allow all |
| Blocking CSS and JS files | Prevents Google from rendering pages properly | Allow all static assets unless you have a specific reason not to |
| Blocking a folder that contains your blog posts | Entire blog section disappears from Google | Audit folder paths before adding Disallow rules |
Case mismatch (/Admin/ vs /admin/) |
Rule does not apply to the path you intend | Match the exact case used in your actual URLs |
| Disallowing pages you also want indexed | Pages cannot be crawled but may still appear as stubs | Use noindex tags for index control, not robots.txt |
| Missing User-agent line before rules | File is malformed and may be ignored entirely | Every rule group must start with a User-agent line |
Using robots.txt to Manage Crawl Budget
For sites with tens of thousands of pages, crawl budget becomes a real concern. Google allocates a limited number of crawls per day to each site based on its authority and server speed. Wasting crawl budget on low-value pages means important pages get crawled less frequently.
Pages that consume crawl budget without adding value
- Faceted navigation with URL parameters (
?color=red&size=large) - Infinite pagination (
/page/2/,/page/3/beyond a few pages) - Internal search result pages
- Session IDs appended to URLs
- Duplicate category and tag archive pages
- Printer-friendly page versions
- Development or staging subdirectories left live
Crawl budget is most relevant for sites above 10,000 indexable URLs. For smaller sites, Google will typically crawl everything it can find regardless of robots.txt optimisation. Focus on fixing errors and improving page speed first.
How to Test and Validate Your robots.txt File
Never assume your robots.txt is working as intended — always verify it. Even a small typo can produce unexpected results that may not be obvious until you notice a traffic drop weeks later.
Testing methods
- Google Search Console robots.txt tester — go to Settings > robots.txt and use the built-in tester to check whether specific URLs are allowed or blocked by your current file.
- URL Inspection tool — inspect individual URLs in Search Console to see if Googlebot can access them. A blocked URL will show as "Crawled but currently not indexed" with a robots.txt reason.
- Fetch your file directly — visit
yourdomain.com/robots.txtin a browser to confirm the file exists and reads as expected. - Third-party validators — tools like Merkle's robots.txt tester allow you to paste your file and test specific URLs against it without needing Search Console access.
After making changes, use the SEOGuy SEO Analyzer to quickly inspect a URL and confirm it is crawlable and properly accessible to search engines.
Targeting Specific Crawlers
You can write different rules for different crawlers by using specific user-agent names instead of the wildcard *. This is useful when you want to block AI training bots but allow Googlebot, or give one crawler more permissive access than others.
Common crawler user-agent names
| Crawler | User-Agent Name | Use Case |
|---|---|---|
| Google Search | Googlebot | Primary search indexing |
| Google Images | Googlebot-Image | Image search indexing |
| Bing Search | Bingbot | Microsoft Bing indexing |
| OpenAI GPT | GPTBot | AI training data collection |
| Anthropic Claude | ClaudeBot | AI training data collection |
| Common Crawl | CCBot | Open web archive crawling |
# Allow all standard search crawlers User-agent: * Disallow: /admin/ Allow: / # Block AI training bots User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / Sitemap: https://yourdomain.com/sitemap.xml
robots.txt vs noindex: When to Use Each
One of the most common points of confusion in technical SEO is when to use robots.txt and when to use a noindex tag. They serve different purposes and using the wrong one creates problems.
| Situation | Use |
|---|---|
| Page should never appear in Google search results | noindex tag |
| Page wastes crawl budget and has no value | robots.txt Disallow |
| Admin or login page — should not be crawled at all | robots.txt Disallow |
| Page should not be indexed but canonical is important | noindex tag |
| Blocking robots.txt AND adding noindex | Conflict — avoid |
| Keeping a page private from all users | Authentication/password protection |
Do not add a noindex tag to a page that is also blocked by robots.txt. If Google cannot crawl the page, it cannot read the noindex directive. The page may then appear in Google's index as a URL-only stub with no description — exactly what you were trying to prevent.
Generate a Safe robots.txt in Seconds
Use the SEOGuy Robots.txt Generator to build a correctly structured, safe robots.txt file for your site — no risk of accidentally blocking Googlebot. Choose your site type, select what to block, and download.
Try the Robots.txt Generator FreeTools You Can Use on SEOGuy.Online
These free tools from SEOGuy.Online help you configure, validate, and audit your robots.txt file and broader crawl setup:
Key Takeaways
- robots.txt controls crawling, not indexing — blocked pages can still appear in search if linked from elsewhere
- Start with an open configuration and add Disallow rules only for paths you have audited carefully
- Never publish
Disallow: /on a live site — it blocks every crawler from your entire website - Always include your sitemap URL in your robots.txt file
- Test your robots.txt in Google Search Console before relying on it in production
- Rules are case-sensitive — match the exact path casing used in your real URLs
- Use noindex tags to remove pages from Google's index; use robots.txt only to manage crawl access
- Never combine robots.txt Disallow with noindex — the noindex tag cannot be read if the page is blocked
- For large sites, block low-value parameter URLs and duplicate path patterns to protect crawl budget
- You can target specific crawlers by name — useful for blocking AI training bots while allowing Googlebot
- Validate your file after every change and recheck after any CMS update or migration
Configuring a robots.txt file safely comes down to one principle: default to open, restrict deliberately, and always verify your changes before they go live. A well-maintained robots.txt is a powerful tool for directing crawler attention where it matters most — and an untested one is one of the fastest ways to lose your rankings.