Robots.txt Configuration Guide: Control How Search Engines Crawl Your Site
Robots.txt is your first line of communication with search engines. Learn how to configure it properly to improve SEO and control crawling.
What Is Robots.txt?
Robots.txt is a text file that tells search engine crawlers which pages or sections of your website they can or cannot access. It's part of the Robots Exclusion Protocol (REP) and serves as the first file crawlers look for when visiting your site.
How Robots.txt Works
- Crawler arrives at your domain
- Checks for robots.txt at yourdomain.com/robots.txt
- Reads the rules for its specific user-agent
- Follows or ignores based on directives
- Proceeds to crawl allowed pages
Important Limitations
Robots.txt is NOT:
- A security mechanism (doesn't block access)
- A guarantee pages won't be indexed
- Required (crawlers will index without it)
- Retroactive (existing indexed pages stay)
- A suggestion to well-behaved crawlers
- A crawl efficiency tool
- A way to prioritize important content
Robots.txt Syntax
Basic Structure
User-agent: [crawler name]
Disallow: [URL path to block]
Allow: [URL path to allow]
Sitemap: [sitemap URL]Core Directives
User-agent: Specifies which crawler the rules apply to.
User-agent: Googlebot
User-agent: Bingbot
User-agent: * # All crawlersDisallow: Blocks access to specified paths.
Disallow: /private/
Disallow: /admin/
Disallow: /temp.htmlAllow: Explicitly allows access (overrides Disallow).
Disallow: /directory/
Allow: /directory/public/Sitemap: Points to your XML sitemap location.
Sitemap: https://example.com/sitemap.xmlPattern Matching
Wildcards:
| Symbol | Meaning | Example |
|---|---|---|
| Match any sequence | Disallow: /.pdf | |
| $ | End of URL | Disallow: /*.php$ |
# Block all PDF files
Disallow: /*.pdf$
# Block URLs with query parameters
Disallow: /*?*
# Block all .php files
Disallow: /*.php$
# Block specific parameter
Disallow: /*?sessionid=Common Robots.txt Configurations
Allow All Crawlers
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlBlock All Crawlers
User-agent: *
Disallow: /Warning: This prevents all indexing. Use only for development sites.
Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /cgi-bin/
Sitemap: https://example.com/sitemap.xmlWordPress Configuration
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-json/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /*?s=
Disallow: /*?p=
Disallow: /tag/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xmlE-commerce Configuration
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /compare/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /search/
Disallow: /review/
Allow: /products/
Allow: /categories/
Sitemap: https://shop.com/sitemap-index.xmlBlock Specific Bots
# Block aggressive crawlers
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
# Allow main search engines
User-agent: Googlebot
Disallow: /private/
User-agent: Bingbot
Disallow: /private/Major Search Engine Crawlers
Crawler Names
| Search Engine | User-agent |
|---|---|
| Googlebot | |
| Google Images | Googlebot-Image |
| Google News | Googlebot-News |
| Bing | Bingbot |
| Yahoo | Slurp |
| DuckDuckGo | DuckDuckBot |
| Baidu | Baiduspider |
| Yandex | YandexBot |
Specific Crawler Rules
# Different rules for different crawlers
User-agent: Googlebot
Disallow: /google-specific/
User-agent: Bingbot
Disallow: /bing-specific/
User-agent: *
Disallow: /general-block/SEO Best Practices
Do's
1. Always include a sitemap reference:
Sitemap: https://example.com/sitemap.xml2. Block low-value content:
- Duplicate pages
- Parameter-generated pages
- Print versions
- Internal search results
- Block utility pages
- Prevent crawling of filtered/sorted pages
- Limit parameter crawling
- Use Google Search Console robots.txt tester
- Verify no important pages blocked
Don'ts
1. Don't block important resources:
# BAD - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/2. Don't use for security:
# BAD - sensitive URLs are now public
Disallow: /secret-admin-panel/
Disallow: /hidden-backup-files/3. Don't block your sitemap:
# BAD
Disallow: /sitemap.xml4. Don't create syntax errors:
# BAD - missing colon
User-agent *
Disallow /adminTroubleshooting Common Issues
Problem: Pages Still Being Indexed
Possible Causes:
- External links to the page
- Previous indexing before robots.txt update
- Robots.txt rule not matching URL pattern
- Add noindex meta tag to page
- Request URL removal in Search Console
- Verify robots.txt rules with tester
Problem: Important Pages Not Indexed
Check:
- Robots.txt not blocking the pages
- No noindex tags on pages
- Pages included in sitemap
- Pages have internal links
Problem: CSS/JS Not Loading in Search Results
Cause: Blocking resource files
Fix:
Allow: /css/
Allow: /js/
Allow: /images/Problem: Crawl Budget Wasted
Signs:
- Server logs show excessive crawling
- Unimportant pages crawled frequently
- Important pages updated slowly
# Block parameter variations
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?ref=
Disallow: /*?session=Testing Robots.txt
Google Search Console
- Go to Search Console
- Navigate to robots.txt Tester
- Test specific URLs
- Check for errors and warnings
Manual Testing
Check robots.txt is accessible:
https://yourdomain.com/robots.txtVerify:
- File returns 200 status
- Correct syntax used
- No typos in paths
Testing Tools
- Google Search Console robots.txt Tester
- Bing Webmaster Tools
- Online robots.txt validators
- Browser direct access
Advanced Configurations
Crawl-delay Directive
Note: Not supported by Google, but works for Bing/Yandex.
User-agent: Bingbot
Crawl-delay: 10
User-agent: YandexBot
Crawl-delay: 5Multiple Sitemaps
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-images.xmlHost Directive
For Yandex (preferred domain):
Host: https://www.example.comRobots.txt vs. Other Methods
Comparison
| Method | Purpose | Crawling | Indexing |
|---|---|---|---|
| Robots.txt | Crawl control | Blocks | May still index |
| Meta robots | Page-level | Allows | Controls indexing |
| X-Robots-Tag | Header-level | Allows | Controls indexing |
| Password | Access control | Blocks | Blocks |
When to Use Each
Robots.txt:
- Large sections of site
- Crawl budget optimization
- Utility directories
- Individual pages
- Duplicate content
- Thin content pages
- Truly private content
- Member-only areas
- Staging environments
Using the Robots.txt Generator
ToolPop's Robots.txt Generator helps you:
- Create custom rules with easy interface
- Select common patterns for your platform
- Add multiple user-agents and rules
- Include sitemap references
- Generate valid syntax automatically
Steps to Generate
- Choose your website type
- Select directories to block
- Add custom rules if needed
- Include sitemap URL
- Generate and download
Conclusion
A well-configured robots.txt file:
- Guides crawlers to your important content
- Protects crawl budget from waste
- Prevents indexing of low-value pages
- Communicates sitemap location
- Improves SEO efficiency
Try Our Free Tools
Put these tips into practice with our free online tools. No signup required.
Explore Tools