Jul 20, 2008

Htaccess To Block Bots - How do you block bots and spiders?

Free Web Hosting, No Ads > General > Hosted Members Area

free web hosting

Htaccess To Block Bots - How do you block bots and spiders?

jlhaslip
I am curious about how to block the bots and / or spiders from accessing my subdomain.
I believe that I may have one using a terrible amount of Bandwidth.
Is this possible? and how do I stop it?

Reply

guangdian
bots would
QUOTE
using a terrible amount of Bandwidth
???
that's the first time i got that ?
is it really,how you got the results that it waste you so much bandwidth?
and by the way banned the bots is not clever thing i think.if Google search results not containng your site.then there is no more new guests to view your site.

Reply

jlhaslip
The site I have uploaded is not yet public. I also host a forum which is private. So far this month, about one third of my bandwidth is showing as being used by enquiries to/from the USA.
Nothing against the Americans, of course, but the membership of the forum using my hosting account here at the trap consists entirely of Canadian members. And since the only place that I have announced the site is here at the trap17, it really is not public.
The 'wasted/lost' bandwidth this month alone is over 180 megs out of my allowed 512 megs., so I am trying to put a stop to it. I suspect someone is hijacking my bandwidth. Don't know how or why, but it would be nice to put a stop to it...

Any Ideas???
(I've placed IP bans on a couple of them. I'll check again tomorrow to see if they continue.)

Reply

Dooga
Perhaps you're just editing too much? lol! It happens to me...

Reply

Lyon2
If you want my advice, don't use htaccess, insted use the well known "robots.txt".

If you have a problem building robots.txt, no problem, there's a great tool for that, named :

RoboGen (free edition)
http://www.rietta.com/downloads/robogen_le_setup.exe

This is a fantastic litle tool that has all the most popular spiders,bots,etc, for you to allow or deny the access to them to the pages of your website, it's pretty easy to learn and work with this tool, but if you encounter any problems, just private email me.

Reply

jlhaslip
well, as it turns out, this was all "much ado about nothing".

I banned the suspected IP addresses, used the Robot Link above and then one of the forum users couldn't sign-in. Turns out her satallite connection gets re-directed to an US based facility, so the Bandwidth usage was legitimate.

Just another lesson to be had here.

Reply

Dragonfly
Well, this really interest me too. Google doesn't list my site in any of the search results. However, google bot is almost twenty four hours present in this particular site. And I also noticed that for about 100MB bandwidth has been wasted this way.

I've read the inputs from some members, the best way to control the bots are robots. we can implement this even from meta tag without using any other method. Visit once a month may do. I need to see the particular code and insert it in the meta tag.

I hope more experts will put more inputs here.

Reply

Lyon2
Dragonfly, yes, you don't need to create the robots.txt file and insert it in the root directory of your website, but if you choose to use the metatag insted, you won't have to much options to protect your website directorys, and beleave me, your site will be extremelly vulnareble to agressive spiders and bots.

There's spiders made by skilled hackers to scan entire websites for vulnereble stuff, and if you don't have a well configured robots.txt file, you'll end up one day searching google for some keywords and your website passwords turne up as the results, like happens to many people.

But, forget about this particullar spiders and bots made by some skilled hackers, and let's talk about the google spider or bot.

Perhaps you have absolutly no idea of what's the REAL POWER OF GOOGLE!

Google can find passwords, usernames, cgi blackholes, sensivity data, vulnereble data, databases usernames and passwords, and millions of private things that most of webdesigners don't even imagine, and why, because they don't care about security, don't don't create the robots.txt file and/or the htaccess files (for linux's servers), because the search engines only speek one language, and that is robots.txt (allow and/or disallow access) and htaccess (also allow and/or disallow access).

If you want to really know all the best techniques to find this secret stuff with google, check the above website, wich the main goal is to help webdesigners protect their websites from "google hackers or google hacking".

You have to register at:
http://johnny.ihackstuff.com

Then visit the google hacking database of querys at:
http://johnny.ihackstuff.com/index.php?module=prodreviews

Now, getting back to robots.txt, a normal robots.txt look like this:


Disallowing all the spiders:

# Your website title -- http://yourwebsitedomainname.domain
# Robot Exclusion File -- robots.txt
# Author: your name
# Last Updated: The date

User-agent: *
Disallow: /dd


This robots.txt code will disallow any search engine of indexing the "dd" directory, but this is just an example.

Also this code was created with the robogen LE:

RoboGen (free edition)
http://www.rietta.com/downloads/robogen_le_setup.exe

If you want to create and edit htaccess files, there's a very usefull and pretty easy to work free tool:

HTAccessible
http://www.tlhouse.co.uk/HTAccessible.shtml

(it's constantly updating with more easy one-click functions to protect your directorys and files with htaccess files, and remember that htaccess files are for linux's servers.)

One more thing, allways create the robots.txt file and insert it on the root directory of your website with the configuration for all of your private and public directorys, or, insert the robots.txt file in the directory that you want to protect, that has only configuration for that directory only, wich i don't recommend.

Also if you choose to configure all your directorys in one robots.txt file, remember to insert the below code in it, for the most important and usual directorys of any website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /scripts/
Disallow: /your private directory 1/
Disallow: /your private directory 2/
Disallow: /your private directory 3 and so on/

I'm sure you understand that if you don't say to the search engines to not index the cgi-bin, images, scripts of your website (using the robots.txt file), your sensivity website data will end up in the results of other people searches on google, yahoo and/or altavista wich are the most powerfull search engines on the web.

So, to protect your cgi-bin directory wich is one of the main targets for hackers (website defacers, script kiddies, crackers), you'll have to allways insert the disallow code to this directory.

The images directory is optional, if you don't want google images spider to index your images because you have worked to much in those, i advice you to disallow the access too.

The scripts directory is also a main target, specially if you have php and cgi scripts, so, if you want to protect your work and scripts configuration, also disallow the access to this one too.

And there's much more sensitive directorys that you should, no, you must protect, wich could be, for example:

- email;
- newsletter;
- mailing lists;
- spreadsheets (excel data);
- and much more.

This directorys depend of what your website has and what has to offer, for example, if you sell templates, ebooks, videos-tutorials, you'll have also to protect this directorys or you'll end up giving all of your work to website scanners, wich by the way, it's happening all the time to webdesign beginners with no experience.


One more extremelly important thing, if you usually work with cgi or perl, specially cgi, wich is a litle bit different of perl, be very carefull with the scripts that you use on your websites, cause there are tons of high quality programs to scan cgi websites and cgi scripts in websites, for example:

Cgi Scan
http://217.125.24.22/h/cgiscan.zip

Run the above tool in your website to see if it has cgi black holes, wich are very apreciated by "website defacers, script kiddies and crackers".

To finish this, if you want to learn much more stuff about robots,spiders,search engines and specially google search engine, tell me and i'll send you some high quality ebooks about it.

There's so much to tell and not many time to actually tell it!

 

 

 


Reply



Got an Opinion! Express your Views! (no registration):-
Add your Reply/ Opinion/ Views/ Comments/ Suggestion/ Questions/ Queries etc.
Posts with decent grammar & English will be accepted and please refrain from profanities.
For asking a Question, We recommend you to sign-up (for free) so that you can track the topic easily.

Nature of your Post*: Opinion/ Reply/ Comments
Question/Query
Feedback to us.
       
Name   Email
Title/Question*

(Maximum characters: 10,000)
You have characters left.
Confirm Code:

Recent Queries:-
  1. disallow directory listing htaccess - 5.68 hr back. (1)
  2. htaccess google bot index - 6.44 hr back. (1)
  3. htaccess block google bot - 26.50 hr back. (1)
  4. block search me bot - 27.48 hr back. (1)
  5. stop spiders from indexing your site using htaccess - 29.80 hr back. (1)
  6. block folders .htaccess - 31.40 hr back. (1)
  7. htaccess disallow access - 36.88 hr back. (1)
Similar Topics

Keywords : htaccess, block, bots, block, bots, spiders

  1. Site Management Tools: Look Before You Leap!
    The danger of using Index Manager and other .htaccess altering tools.. (3)
  2. .htaccess Problem
    how to allow? (2)
    Order Allow,Deny Order Deny from All How do i change this to allow? I tried Order
    Allow,Deny Order Allow from All That didnt work so how do i do it? By the way i have read
    the support forum and i couldnt find anything.....
  3. Problem With Htaccess (i Guess)...
    (0)
    So here we go: I use htaccess rules to have SEF url`s on my site..here`s my htaccess file:
    QUOTE ForceType application/x-httpd-php ForceType application/x-httpd-php ForceType
    application/x-httpd-php ForceType application/x-httpd-php And I have those files (quiz,
    tour, logout etc) in my root ...according to the htaccess file, those files will be parsed as php..
    The urls are like: hts.trap17.com/tour/signup , for example here`s the content of one of that
    files: QUOTE $url=$_SERVER ; $url=explode('/',$u....
  4. Path To /.htaccess Folder?
    (2)
    Well - I'm trying to set up .htaccess for a password to a file on my site, but I can't login
    with the user and pass I gave. I think it's cause I don't have the whole directory. I'm
    using this site: http://www.tools.dynamicdrive.com/password/ This is my .htaccess: AuthName
    "Restricted Area" AuthType Basic AuthUserFile /.htpasswds/.htpasswd AuthGroupFile /dev/null
    require valid-user I noticed the htpasswds directory when I was looking around. I expect that is
    what it's for, right? Anyhow, does anyone know the path I need to use this?....
  5. Renaming Your Url's
    Can you use htaccess? (3)
    I was wondering if you could change and example link, http://frugoproductions.com/ index.php?n=
    pages/videos into something like http://frugoproductions.com/pages/videos(.php) or something
    easier for visitors to find the page. I'm looking at URL rewriting, but I am unsure if this can
    solve me problem. It can be seen at http://www.webhostauditor.com/articles/url_rewriting.shtml
    I'm not sure if the URL rewriting is the same, because, when it has ?n= it usually means
    something like it takes the index page's templete and uses it for the pages that can be ad....
  6. .htaccess File In Hosting Space Modification
    what happens if i modify it? (5)
    hello guys!... i just installed from fantastico joomla and i while i was installing it, it
    prompt me if i wish to install it on the root directory i should delete the .htaccess that is there,
    my question is, what happens if i delete the file located there or if i replace the file? Is it
    going to damage my site, the login or something? Title modified. ....
  7. Concerned Security With Hosting Application Info
    Spam bots can harvest emails in requests (5)
    I was looking around at the posts in the free hosting request section (just for fun), and noticed a
    major problem with the applications. For every application that is made, the email address of that
    applicant is shown to the world, including SPAMBOTS!!! This is a major flaw in the
    aplication process, and will lead to increase levels of spam in every member's inbox. This is
    the only thing that is wrong with any part of the Trap17 site. Editing topic title ....
  8. Need .htaccess Help
    Internal server error (1)
    Initially my .htaccess file looked like this: QUOTE # -FrontPage- IndexIgnore .htaccess */.??*
    *~ *# */HEADER* */README* */_vti* order deny,allow deny from all allow from all order
    deny,allow deny from all AuthName www.beeseven.trap17.com AuthUserFile
    /home/beeseven/public_html/_vti_pvt/service.pwd AuthGroupFile
    /home/beeseven/public_html/_vti_pvt/service.grp Then I tried adding a few things, and I got an
    internal server error. QUOTE Content-Type: text/html; charset=utf-8 Content-Language: en
    ErrorDocument 403 /403.shtml ErrorDocument 404 /404.shtml A....
  9. Is .htaccess Allowed ?
    as the topic (0)
    is mod rewrite allowed in trap17 server ? coz i need to change my dynamic page into a static url .
    thx....

    1. Looking for htaccess, block, bots, block, bots, spiders

Searching Video's for htaccess, block, bots, block, bots, spiders
advertisement



Htaccess To Block Bots - How do you block bots and spiders?



 

 

 

 

ADD REPLY / Got an Opinion! Remove these ADs! RAPID SEARCH! Free Web Hosting [X]
Express your Opinions, Thoughts or Contribute more info. to help others.
Ask your Doubts & Queries to get answers, So that "Together We can help others!"
Register FREE for AD-FREE forum, Create your own topics, Ask Questions, track topics, setup subscriptions & notifications and Get a Free Website w/ Email and FTP.
500MB Space *No Ads*, CPanel, FTP, PHP, MySQL, EMails - 100% FREE