Back to BlogGuides

Robots.txt Configuration Guide: Control How Search Engines Crawl Your Site

Robots.txt is your first line of communication with search engines. Learn how to configure it properly to improve SEO and control crawling.

ToolPop TeamMarch 8, 202514 min read

What Is Robots.txt?

Robots.txt is a text file that tells search engine crawlers which pages or sections of your website they can or cannot access. It's part of the Robots Exclusion Protocol (REP) and serves as the first file crawlers look for when visiting your site.

How Robots.txt Works

Crawler arrives at your domain
Checks for robots.txt at yourdomain.com/robots.txt
Reads the rules for its specific user-agent
Follows or ignores based on directives
Proceeds to crawl allowed pages

Important Limitations

Robots.txt is NOT:

A security mechanism (doesn't block access)
A guarantee pages won't be indexed
Required (crawlers will index without it)
Retroactive (existing indexed pages stay)

Robots.txt IS:

A suggestion to well-behaved crawlers
A crawl efficiency tool
A way to prioritize important content

Robots.txt Syntax

Basic Structure

User-agent: [crawler name]
Disallow: [URL path to block]
Allow: [URL path to allow]
Sitemap: [sitemap URL]

Core Directives

User-agent: Specifies which crawler the rules apply to.

User-agent: Googlebot
User-agent: Bingbot
User-agent: *  # All crawlers

Disallow: Blocks access to specified paths.

Disallow: /private/
Disallow: /admin/
Disallow: /temp.html

Allow: Explicitly allows access (overrides Disallow).

Disallow: /directory/
Allow: /directory/public/

Sitemap: Points to your XML sitemap location.

Sitemap: https://example.com/sitemap.xml

Pattern Matching

Wildcards:

Symbol	Meaning	Example
	Match any sequence	Disallow: /.pdf
$	End of URL	Disallow: /*.php$

Examples:

# Block all PDF files
Disallow: /*.pdf$

# Block URLs with query parameters
Disallow: /*?*

# Block all .php files
Disallow: /*.php$

# Block specific parameter
Disallow: /*?sessionid=

Common Robots.txt Configurations

Allow All Crawlers

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block All Crawlers

User-agent: *
Disallow: /

Warning: This prevents all indexing. Use only for development sites.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Disallow: /cgi-bin/

Sitemap: https://example.com/sitemap.xml

WordPress Configuration

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-json/
Disallow: /trackback/
Disallow: /xmlrpc.php
Disallow: /*?s=
Disallow: /*?p=
Disallow: /tag/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml

E-commerce Configuration

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /compare/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /search/
Disallow: /review/
Allow: /products/
Allow: /categories/

Sitemap: https://shop.com/sitemap-index.xml

Block Specific Bots

# Block aggressive crawlers
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

# Allow main search engines
User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /private/

Major Search Engine Crawlers

Crawler Names

Search Engine	User-agent
Google	Googlebot
Google Images	Googlebot-Image
Google News	Googlebot-News
Bing	Bingbot
Yahoo	Slurp
DuckDuckGo	DuckDuckBot
Baidu	Baiduspider
Yandex	YandexBot

Specific Crawler Rules

# Different rules for different crawlers
User-agent: Googlebot
Disallow: /google-specific/

User-agent: Bingbot
Disallow: /bing-specific/

User-agent: *
Disallow: /general-block/

SEO Best Practices

Do's

1. Always include a sitemap reference:

Sitemap: https://example.com/sitemap.xml

2. Block low-value content:

Duplicate pages
Parameter-generated pages
Print versions
Internal search results

3. Protect crawl budget:

Block utility pages
Prevent crawling of filtered/sorted pages
Limit parameter crawling

4. Test before deploying:

Use Google Search Console robots.txt tester
Verify no important pages blocked

Don'ts

1. Don't block important resources:

# BAD - blocks CSS/JS needed for rendering
Disallow: /css/
Disallow: /js/

2. Don't use for security:

# BAD - sensitive URLs are now public
Disallow: /secret-admin-panel/
Disallow: /hidden-backup-files/

3. Don't block your sitemap:

# BAD
Disallow: /sitemap.xml

4. Don't create syntax errors:

# BAD - missing colon
User-agent *
Disallow /admin

Troubleshooting Common Issues

Problem: Pages Still Being Indexed

Possible Causes:

External links to the page
Previous indexing before robots.txt update
Robots.txt rule not matching URL pattern

Solutions:

Add noindex meta tag to page
Request URL removal in Search Console
Verify robots.txt rules with tester

Problem: Important Pages Not Indexed

Check:

Robots.txt not blocking the pages
No noindex tags on pages
Pages included in sitemap
Pages have internal links

Problem: CSS/JS Not Loading in Search Results

Cause: Blocking resource files

Fix:

Allow: /css/
Allow: /js/
Allow: /images/

Problem: Crawl Budget Wasted

Signs:

Server logs show excessive crawling
Unimportant pages crawled frequently
Important pages updated slowly

Solution:

# Block parameter variations
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?ref=
Disallow: /*?session=

Testing Robots.txt

Google Search Console

Go to Search Console
Navigate to robots.txt Tester
Test specific URLs
Check for errors and warnings

Manual Testing

Check robots.txt is accessible:

https://yourdomain.com/robots.txt

Verify:

File returns 200 status
Correct syntax used
No typos in paths

Testing Tools

Google Search Console robots.txt Tester
Bing Webmaster Tools
Online robots.txt validators
Browser direct access

Advanced Configurations

Crawl-delay Directive

Note: Not supported by Google, but works for Bing/Yandex.

User-agent: Bingbot
Crawl-delay: 10

User-agent: YandexBot
Crawl-delay: 5

Multiple Sitemaps

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-images.xml

Host Directive

For Yandex (preferred domain):

Host: https://www.example.com

Robots.txt vs. Other Methods

Comparison

Method	Purpose	Crawling	Indexing
Robots.txt	Crawl control	Blocks	May still index
Meta robots	Page-level	Allows	Controls indexing
X-Robots-Tag	Header-level	Allows	Controls indexing
Password	Access control	Blocks	Blocks

When to Use Each

Robots.txt:

Large sections of site
Crawl budget optimization
Utility directories

Meta robots noindex:

Individual pages
Duplicate content
Thin content pages

Password Protection:

Truly private content
Member-only areas
Staging environments

Using the Robots.txt Generator

ToolPop's Robots.txt Generator helps you:

Create custom rules with easy interface
Select common patterns for your platform
Add multiple user-agents and rules
Include sitemap references
Generate valid syntax automatically

Steps to Generate

Choose your website type
Select directories to block
Add custom rules if needed
Include sitemap URL
Generate and download

Conclusion

A well-configured robots.txt file:

Guides crawlers to your important content
Protects crawl budget from waste
Prevents indexing of low-value pages
Communicates sitemap location
Improves SEO efficiency

Use ToolPop's free Robots.txt Generator to create a properly configured file for your website. Test it thoroughly before deployment, and monitor your search console for any crawling issues!

The Ultimate Guide to Image Compression for the Web in 2025

Slow websites lose visitors. Learn the art and science of image compression to speed up your site without sacrificing visual quality.

Guides

SEO Basics: A Beginner's Guide to Search Engine Optimization

Want your website to rank higher on Google? This beginner-friendly guide covers everything you need to know about SEO fundamentals.