nginx: How To Set A Connection Limit For Search Engine Bots Gone Wild (Especially Bingbot)

Want to support HowtoForge? Become a subscriber!
 
Submitted by falko (Contact Author) (Forums) on Mon, 2013-05-27 17:24. :: Web Server | nginx

nginx: How To Set A Connection Limit For Search Engine Bots Gone Wild (Especially Bingbot)

Version 1.0
Author: Falko Timme <ft [at] falkotimme [dot] com>
Follow me on Twitter
Last edited 04/18/2013

As a server administrator you might know this problem: you have done everything to optimize your server and it's working really well, and along comes a stupid search engine bot (like Bingbot) and hits all your vhosts at the same time with hundreds of connections, making your server load go up. Of course, you don't want to completely block the bot (unless you don't care about that particular search engine), so you can use robots.txt and/or nginx to control connections of a search engine bot to your server.

I do not issue any guarantee that this will work for you!

 

1 Preliminary Note

I'm focusing on Bingbot in this tutorial because this bot creates an excessive amount of connections each time it visits a web site (I haven't noticed this for any other search engine bot). Of course, the first thing you should do is limit the crawl rate in the Bing webmaster tools. If that doesn't help or you don't have access to Bing webmaster tools for all vhosts on your server, read on.

 

2 Using robots.txt

Bingbot understands the Crawl-delay directive (Googlebot doesn't so don't use this for Googlebot!), so you can use this in your robots.txt file (see http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbot-s-question.aspx):

User-Agent: bingbot
Crawl-delay: 1

Because you've created an extra section for Bingbot in your robots.txt, your Allow/Disallow directives for User-Agent: * aren't valid for Bingbot anymore, so make sure to repeat your Allow/Disallow directives for Bingbot, e.g. like this:

User-Agent: *
Disallow: /cache/
Disallow: /engine/
Disallow: /files/
Disallow: /templates/
Disallow: /uploads/
Disallow: /newsletter/
Disallow: /kontaktformular/
Disallow: /widerrufsrecht/
Disallow: /datenschutz-und-sicherheit
Disallow: /agb/
Disallow: /shopware.php/sViewport,admin
Disallow: /shopware.php/sViewport,note
Disallow: /shopware.php/sViewport,basket
Disallow: /shopware.php/sViewport,rma
Disallow: /shopware.php/sViewport,support
Disallow: /shopware.php/sViewport,ticket
Disallow: /shopware.php/sViewport,newsletter
Disallow: /shopware.php/sViewport,tellafriend
Sitemap: http://www.example.com/sitemap.xml

User-Agent: bingbot
Crawl-delay: 1
Disallow: /cache/
Disallow: /engine/
Disallow: /files/
Disallow: /templates/
Disallow: /uploads/
Disallow: /newsletter/
Disallow: /kontaktformular/
Disallow: /widerrufsrecht/
Disallow: /datenschutz-und-sicherheit
Disallow: /agb/
Disallow: /shopware.php/sViewport,admin
Disallow: /shopware.php/sViewport,note
Disallow: /shopware.php/sViewport,basket
Disallow: /shopware.php/sViewport,rma
Disallow: /shopware.php/sViewport,support
Disallow: /shopware.php/sViewport,ticket
Disallow: /shopware.php/sViewport,newsletter
Disallow: /shopware.php/sViewport,tellafriend
Sitemap: http://www.example.com/sitemap.xml

 

3 Using nginx To Control Bot Connections

We can use the HttpGeoModule and the HttpLimitReqModule to control connections of search engine bots to your nginx server. Open /etc/nginx/nginx.conf...

vi /etc/nginx/nginx.conf

... and add this to your http {} container (before the part where your vhost configuration files are included/defined):

[...]
geo $isabot {
        default 0;
        #bingbot
        157.55.32.0/24 1;
        157.56.229.0/24 1;
        157.56.93.0/24 1;
        157.55.33.0/24 1;
}
map $isabot $limited_ip_key {
    0 '';
    1 $binary_remote_addr;
}
limit_req_zone $limited_ip_key zone=isabot:5m rate=2r/s;
limit_req zone=isabot burst=200;
[...]

This will limit Bingbot's crawl rate to two requests per second; exceeding connections will be delayed and put in the 200 bursts (until the burst is full, then nginx will return a 503 error to Bingbot).

Of course, you are free to add more IPs or subnets to the geo container.

Don't forget to reload nginx:

/etc/init.d/nginx reload

 

4 Links

 

About The Author

Falko Timme is the owner of nginx WebhostingTimme Hosting (ultra-fast nginx web hosting). He is the lead maintainer of HowtoForge (since 2005) and one of the core developers of ISPConfig (since 2000). He has also contributed to the O'Reilly book "Linux System Administration".


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by Anonymous (not registered) on Thu, 2013-07-18 21:01.
thanks! Excelent. r.