blocking bots

Discussion in 'Server Operation' started by blinky, Nov 10, 2012.

  1. blinky

    blinky Member

    What's the best way to block bots from searching your website?

    I have created a robots.txt file which looks like this:
    Code:
    User-agent: *
    Disallow: /
    Disallow: /cgi-bin/
    
    I have included the following in my index.html file:
    Code:
    <meta name="robots" content="NOINDEX, NOFOLLOW">
    
    And I have also included an .htaccess file in my root which looks like this:
    Code:
    SetEnvIfNoCase User-Agent "^Yandex*" bad_bot
    Order Deny,Allow
    Deny from env=bad_bot
    
    Yet I'm still seeing entries in Apache's access.log:
    Code:
    178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /robots.txt HTTP/1.1" 200 324 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    178.154.164.251 - - [10/Nov/2012:04:33:14 -0500] "GET /phpbb/search.php?search_id=active_topics&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3735 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    178.154.164.251 - - [10/Nov/2012:04:33:17 -0500] "GET /phpbb/ucp.php?mode=login&sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3513 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    66.249.76.173 - - [10/Nov/2012:06:05:11 -0500] "GET /robots.txt HTTP/1.1" 200 368 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    178.154.164.251 - - [10/Nov/2012:06:32:14 -0500] "GET /phpbb/index.php?sid=3a033d745efebc4ace615dd64e8f63f7 HTTP/1.1" 200 3908 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
    123.125.71.74 - - [10/Nov/2012:06:35:02 -0500] "GET /robots.txt HTTP/1.1" 200 331 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"
    
    I have even included the IP address 178.154.164.251 in my iinbound filter list on my router. (The fact that I see that address still listed in my Apache logs suggests (at least to me) that Yandex isn't coming from that address.


    Thoughts anyone?
     
    Last edited: Nov 10, 2012
  2. falko

    falko Super Moderator

  3. blinky

    blinky Member

    I had tried that but it didn't seem to amke any diffence. Adding a serious of IP address blocks (a bit overboard) seems to ahve worked.

    The:
    Code:
    User-agent: *
    Disallow: /
    
    seems to have stopped the vast majority of activity I'm not interested in having.

    I believe I had a more serious problem though which I'll address in a seperate thread.

    Sheesh, I'm getting more traffic than a free bordello beside a Naval dock!
     
  4. falko

    falko Super Moderator

    You should be aware that this will also block the Google and BING bots...
     
  5. blinky

    blinky Member

    Yes, I'm aware that is should stop ALL bots. And it does if they observe the rules in robots.txt. But if they don't, they'll keep knocking away with the zeal of a vaccuum cleaner salesman pounding on my front door.

    Quick question you might know the answer to...

    When I see an entry in my Apache logfile that says : GET /robots.txt" does that mean the robot has tried to do a search and then has recieved my robots.txt file? I guess what I'm really asking here is must a robot search at least once from 999.999.999.999 to recieve the robots.txt file after which searches from 999.999.999.999 will stop?
     
  6. webguyz

    webguyz HowtoForge Supporter

    Do vaccuum cleaner salesman that make house calls still exist? :D
     
  7. falko

    falko Super Moderator

    Bots explicitly request that file to learn what URLs they are allowed to index.
     

Share This Page