nginx: How To Set A Connection Limit For Search Engine Bots Gone Wild (Especially Bingbot)
As a server administrator, you might know this problem: you have done everything to optimize your server and it's working really well, and along comes a stupid search engine bot (like Bingbot) and hits all your vhosts at the same time with hundreds of connections, making your server load go up. Of course, you don't want to completely block the bot (unless you don't care about that particular search engine), so you can use robots.txt and/or nginx to control connections of a search engine bot to your server.
1 Preliminary Note
I'm focusing on Bingbot in this tutorial because this bot creates an excessive amount of connections each time it visits a web site (I haven't noticed this for any other search engine bot). Of course, the first thing you should do is limit the crawl rate in the Bing webmaster tools. If that doesn't help or you don't have access to Bing webmaster tools for all vhosts on your server, read on.
2 Using robots.txt
Bingbot understands the Crawl-delay directive (Googlebot doesn't so don't use this for Googlebot!), so you can use this in your robots.txt file (see http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2012/05/03/to-crawl-or-not-to-crawl-that-is-bingbot-s-question.aspx):
User-Agent: bingbot Crawl-delay: 1 |
Because you've created an extra section for Bingbot in your robots.txt, your Allow/Disallow directives for User-Agent: * aren't valid for Bingbot anymore, so make sure to repeat your Allow/Disallow directives for Bingbot, e.g. like this:
User-Agent: * Disallow: /cache/ Disallow: /engine/ Disallow: /files/ Disallow: /templates/ Disallow: /uploads/ Disallow: /newsletter/ Disallow: /kontaktformular/ Disallow: /widerrufsrecht/ Disallow: /datenschutz-und-sicherheit Disallow: /agb/ Disallow: /shopware.php/sViewport,admin Disallow: /shopware.php/sViewport,note Disallow: /shopware.php/sViewport,basket Disallow: /shopware.php/sViewport,rma Disallow: /shopware.php/sViewport,support Disallow: /shopware.php/sViewport,ticket Disallow: /shopware.php/sViewport,newsletter Disallow: /shopware.php/sViewport,tellafriend Sitemap: http://www.example.com/sitemap.xml User-Agent: bingbot Crawl-delay: 1 Disallow: /cache/ Disallow: /engine/ Disallow: /files/ Disallow: /templates/ Disallow: /uploads/ Disallow: /newsletter/ Disallow: /kontaktformular/ Disallow: /widerrufsrecht/ Disallow: /datenschutz-und-sicherheit Disallow: /agb/ Disallow: /shopware.php/sViewport,admin Disallow: /shopware.php/sViewport,note Disallow: /shopware.php/sViewport,basket Disallow: /shopware.php/sViewport,rma Disallow: /shopware.php/sViewport,support Disallow: /shopware.php/sViewport,ticket Disallow: /shopware.php/sViewport,newsletter Disallow: /shopware.php/sViewport,tellafriend Sitemap: http://www.example.com/sitemap.xml |
3 Using nginx To Control Bot Connections
We can use the HttpGeoModule and the HttpLimitReqModule to control connections of search engine bots to your nginx server. Open /etc/nginx/nginx.conf...
vi /etc/nginx/nginx.conf
... and add this to your http {} container (before the part where your vhost configuration files are included/defined):
[...] geo $isabot { default 0; #bingbot 157.55.32.0/24 1; 157.56.229.0/24 1; 157.56.93.0/24 1; 157.55.33.0/24 1; } map $isabot $limited_ip_key { 0 ''; 1 $binary_remote_addr; } limit_req_zone $limited_ip_key zone=isabot:5m rate=2r/s; limit_req zone=isabot burst=200; [...] |
This will limit Bingbot's crawl rate to two requests per second; exceeding connections will be delayed and put in the 200 bursts (until the burst is full, then nginx will return a 503 error to Bingbot).
Of course, you are free to add more IPs or subnets to the geo container.
Don't forget to reload nginx:
/etc/init.d/nginx reload
4 Links
- nginx: http://nginx.org/
- nginx Wiki: http://wiki.nginx.org/