ToolPopToolPop
Back to BlogGuides

Robots.txt Configuration Guide: Control How Search Engines Crawl Your Site

Robots.txt is your first line of communication with search engines. Learn how to configure it properly to improve SEO and control crawling.

ToolPop TeamMarch 8, 202514 min read

What Is Robots.txt?

Robots.txt is a text file that tells search engine crawlers which pages or sections of your website they can or cannot access. It's part of the Robots Exclusion Protocol (REP) and serves as the first file crawlers look for when visiting your site.

How Robots.txt Works

  • Crawler arrives at your domain
  • Checks for robots.txt at yourdomain.com/robots.txt
  • Reads the rules for its specific user-agent
  • Follows or ignores based on directives
  • Proceeds to crawl allowed pages

Important Limitations

Robots.txt is NOT:

  • A security mechanism (doesn't block access)
  • A guarantee pages won't be indexed
  • Required (crawlers will index without it)
  • Retroactive (existing indexed pages stay)
Robots.txt IS:
  • A suggestion to well-behaved crawlers
  • A crawl efficiency tool
  • A way to prioritize important content

Robots.txt Syntax

Basic Structure

User-agent: [crawler name]
Disallow: [URL path to block]
Allow: [URL path to allow]
Sitemap: [sitemap URL]

Core Directives

User-agent: Specifies which crawler the rules apply to.

User-agent: Googlebot
User-agent: Bingbot
User-agent: *  # All crawlers

Disallow: Blocks access to specified paths.

Disallow: /private/
Disallow: /admin/
Disallow: /temp.html

Allow: Explicitly allows access (overrides Disallow).

Disallow: /directory/
Allow: /directory/public/

Sitemap: Points to your XML sitemap location.

Sitemap: https://example.com/sitemap.xml

Pattern Matching

Wildcards:

SymbolMeaningExample
Match any sequenceDisallow: /.pdf
$End of URLDisallow: /*.php$
Examples:

# Block all PDF files
Disallow: /*.pdf$

# Block URLs with query parameters
Disallow: /*?*

# Block all .php files
Disallow: /*.php$

# Block specific parameter
Disallow: /*?sessionid=

Common Robots.txt Configurations

Allow All Crawlers

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block All Crawlers

User-agent: *
Disallow: /

Warning: This prevents all indexing. Use only for development sites.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /cgi-bin/

Sitemap: https://example.com/sitemap.xml

WordPress Configuration

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-json/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /*?s=
Disallow: /*?p=
Disallow: /tag/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml

E-commerce Configuration

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /compare/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /search/
Disallow: /review/
Allow: /products/
Allow: /categories/

Sitemap: https://shop.com/sitemap-index.xml

Block Specific Bots

# Block aggressive crawlers
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

# Allow main search engines
User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /private/

Major Search Engine Crawlers

Crawler Names

Search EngineUser-agent
GoogleGooglebot
Google ImagesGooglebot-Image
Google NewsGooglebot-News
BingBingbot
YahooSlurp
DuckDuckGoDuckDuckBot
BaiduBaiduspider
YandexYandexBot

Specific Crawler Rules

# Different rules for different crawlers
User-agent: Googlebot
Disallow: /google-specific/

User-agent: Bingbot
Disallow: /bing-specific/

User-agent: *
Disallow: /general-block/

SEO Best Practices

Do's

1. Always include a sitemap reference:

Sitemap: https://example.com/sitemap.xml

2. Block low-value content:

  • Duplicate pages
  • Parameter-generated pages
  • Print versions
  • Internal search results
3. Protect crawl budget:
  • Block utility pages
  • Prevent crawling of filtered/sorted pages
  • Limit parameter crawling
4. Test before deploying:
  • Use Google Search Console robots.txt tester
  • Verify no important pages blocked

Don'ts

1. Don't block important resources:

# BAD - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/

2. Don't use for security:

# BAD - sensitive URLs are now public
Disallow: /secret-admin-panel/
Disallow: /hidden-backup-files/

3. Don't block your sitemap:

# BAD
Disallow: /sitemap.xml

4. Don't create syntax errors:

# BAD - missing colon
User-agent *
Disallow /admin

Troubleshooting Common Issues

Problem: Pages Still Being Indexed

Possible Causes:

  • External links to the page
  • Previous indexing before robots.txt update
  • Robots.txt rule not matching URL pattern
Solutions:
  • Add noindex meta tag to page
  • Request URL removal in Search Console
  • Verify robots.txt rules with tester

Problem: Important Pages Not Indexed

Check:

  • Robots.txt not blocking the pages
  • No noindex tags on pages
  • Pages included in sitemap
  • Pages have internal links

Problem: CSS/JS Not Loading in Search Results

Cause: Blocking resource files

Fix:

Allow: /css/
Allow: /js/
Allow: /images/

Problem: Crawl Budget Wasted

Signs:

  • Server logs show excessive crawling
  • Unimportant pages crawled frequently
  • Important pages updated slowly
Solution:
# Block parameter variations
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?ref=
Disallow: /*?session=

Testing Robots.txt

Google Search Console

  • Go to Search Console
  • Navigate to robots.txt Tester
  • Test specific URLs
  • Check for errors and warnings

Manual Testing

Check robots.txt is accessible:

https://yourdomain.com/robots.txt

Verify:

  • File returns 200 status
  • Correct syntax used
  • No typos in paths

Testing Tools

  • Google Search Console robots.txt Tester
  • Bing Webmaster Tools
  • Online robots.txt validators
  • Browser direct access

Advanced Configurations

Crawl-delay Directive

Note: Not supported by Google, but works for Bing/Yandex.

User-agent: Bingbot
Crawl-delay: 10

User-agent: YandexBot
Crawl-delay: 5

Multiple Sitemaps

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-images.xml

Host Directive

For Yandex (preferred domain):

Host: https://www.example.com

Robots.txt vs. Other Methods

Comparison

MethodPurposeCrawlingIndexing
Robots.txtCrawl controlBlocksMay still index
Meta robotsPage-levelAllowsControls indexing
X-Robots-TagHeader-levelAllowsControls indexing
PasswordAccess controlBlocksBlocks

When to Use Each

Robots.txt:

  • Large sections of site
  • Crawl budget optimization
  • Utility directories
Meta robots noindex:
  • Individual pages
  • Duplicate content
  • Thin content pages
Password Protection:
  • Truly private content
  • Member-only areas
  • Staging environments

Using the Robots.txt Generator

ToolPop's Robots.txt Generator helps you:

  • Create custom rules with easy interface
  • Select common patterns for your platform
  • Add multiple user-agents and rules
  • Include sitemap references
  • Generate valid syntax automatically

Steps to Generate

  • Choose your website type
  • Select directories to block
  • Add custom rules if needed
  • Include sitemap URL
  • Generate and download

Conclusion

A well-configured robots.txt file:

  • Guides crawlers to your important content
  • Protects crawl budget from waste
  • Prevents indexing of low-value pages
  • Communicates sitemap location
  • Improves SEO efficiency
Use ToolPop's free Robots.txt Generator to create a properly configured file for your website. Test it thoroughly before deployment, and monitor your search console for any crawling issues!

Tags
robots.txtsearch engine crawlingSEOweb crawlerssite indexingcrawl controlsitemap
Share this article

Try Our Free Tools

Put these tips into practice with our free online tools. No signup required.

Explore Tools