HowtoForge Forums | HowtoForge - Linux Howtos and Tutorials

HowtoForge Forums | HowtoForge - Linux Howtos and Tutorials (http://www.howtoforge.com/forums/index.php)
-   Server Operation (http://www.howtoforge.com/forums/forumdisplay.php?f=5)
-   -   blocking bots (http://www.howtoforge.com/forums/showthread.php?t=59498)

blinky 10th November 2012 15:31

blocking bots
 
What's the best way to block bots from searching your website?

I have created a robots.txt file which looks like this:
Code:

User-agent: *
Disallow: /
Disallow: /cgi-bin/

I have included the following in my index.html file:
Code:

<meta name="robots" content="NOINDEX, NOFOLLOW">
And I have also included an .htaccess file in my root which looks like this:
Code:

SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
Order Deny,Allow
Deny from env=bad_bot

Yet I'm still seeing entries in Apache's access.log:
Code:

178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /robots.txt HTTP/1.1" 200 324 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /phpbb/search.php?search_id=active_topics&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3735 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
178.154.164.251 - - [10/Nov/2012:04:33:17 -0500] "GET /phpbb/ucp.php?mode=login&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3513 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
66.249.76.173 - - [10/Nov/2012:06:05:11 -0500] "GET /robots.txt HTTP/1.1" 200 368 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
178.154.164.251 - - [10/Nov/2012:06:32:14 -0500] "GET /phpbb/index.php?sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3908 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
123.125.71.74 - - [10/Nov/2012:06:35:02 -0500] "GET /robots.txt HTTP/1.1" 200 331 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"

I have even included the IP address 178.154.164.251 in my iinbound filter list on my router. (The fact that I see that address still listed in my Apache logs suggests (at least to me) that Yandex isn't coming from that address.


Thoughts anyone?

falko 11th November 2012 13:28

Try
Code:

User-agent: Yandex
Disallow: /

in your roboty.txt. See http://help.yandex.com/webmaster/?id=1113851

blinky 11th November 2012 20:05

Quote:

Originally Posted by falko (Post 288141)
Try
Code:

User-agent: Yandex
Disallow: /

in your roboty.txt. See http://help.yandex.com/webmaster/?id=1113851

I had tried that but it didn't seem to amke any diffence. Adding a serious of IP address blocks (a bit overboard) seems to ahve worked.

The:
Code:

User-agent: *
Disallow: /

seems to have stopped the vast majority of activity I'm not interested in having.

I believe I had a more serious problem though which I'll address in a seperate thread.

Sheesh, I'm getting more traffic than a free bordello beside a Naval dock!

falko 12th November 2012 14:13

Quote:

Originally Posted by blinky (Post 288149)
The:
Code:

User-agent: *
Disallow: /

seems to have stopped the vast majority of activity I'm not interested in having.

You should be aware that this will also block the Google and BING bots...

blinky 12th November 2012 16:29

Quote:

Originally Posted by falko (Post 288176)
You should be aware that this will also block the Google and BING bots...

Yes, I'm aware that is should stop ALL bots. And it does if they observe the rules in robots.txt. But if they don't, they'll keep knocking away with the zeal of a vaccuum cleaner salesman pounding on my front door.

Quick question you might know the answer to...

When I see an entry in my Apache logfile that says : GET /robots.txt" does that mean the robot has tried to do a search and then has recieved my robots.txt file? I guess what I'm really asking here is must a robot search at least once from 999.999.999.999 to recieve the robots.txt file after which searches from 999.999.999.999 will stop?

webguyz 12th November 2012 17:52

Quote:

Originally Posted by blinky (Post 288197)
....they'll keep knocking away with the zeal of a vaccuum cleaner salesman pounding on my front door...


Do vaccuum cleaner salesman that make house calls still exist? :D

falko 13th November 2012 17:49

Quote:

Originally Posted by blinky (Post 288197)
When I see an entry in my Apache logfile that says : GET /robots.txt" does that mean the robot has tried to do a search and then has recieved my robots.txt file?

Bots explicitly request that file to learn what URLs they are allowed to index.


All times are GMT +2. The time now is 15:40.

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2014, vBulletin Solutions, Inc.