Welcome Guest ( Log In | Register)



 
Reply to this topicStart new topic
> Htaccess To Block Bots, How do you block bots and spiders?
jlhaslip
post Oct 20 2005, 03:18 AM
Post #1


A computer once beat me at chess, but it was no match for me at kick boxing.
Group Icon

Group: [MODERATOR]
Posts: 3,874
Joined: 24-July 05
From: In Trouble Again... still?
Member No.: 9,787
Spam Patrol



I am curious about how to block the bots and / or spiders from accessing my subdomain.
I believe that I may have one using a terrible amount of Bandwidth.
Is this possible? and how do I stop it?
Go to the top of the page
 
+Quote Post
guangdian
post Oct 21 2005, 01:02 AM
Post #2


Trap Grand Marshal Member
***********

Group: Members
Posts: 1,183
Joined: 24-September 04
Member No.: 1,245



bots would
QUOTE
using a terrible amount of Bandwidth
???
that's the first time i got that ?
is it really,how you got the results that it waste you so much bandwidth?
and by the way banned the bots is not clever thing i think.if Google search results not containng your site.then there is no more new guests to view your site.
Go to the top of the page
 
+Quote Post
jlhaslip
post Oct 21 2005, 02:22 AM
Post #3


A computer once beat me at chess, but it was no match for me at kick boxing.
Group Icon

Group: [MODERATOR]
Posts: 3,874
Joined: 24-July 05
From: In Trouble Again... still?
Member No.: 9,787
Spam Patrol



The site I have uploaded is not yet public. I also host a forum which is private. So far this month, about one third of my bandwidth is showing as being used by enquiries to/from the USA.
Nothing against the Americans, of course, but the membership of the forum using my hosting account here at the trap consists entirely of Canadian members. And since the only place that I have announced the site is here at the trap17, it really is not public.
The 'wasted/lost' bandwidth this month alone is over 180 megs out of my allowed 512 megs., so I am trying to put a stop to it. I suspect someone is hijacking my bandwidth. Don't know how or why, but it would be nice to put a stop to it...

Any Ideas???
(I've placed IP bans on a couple of them. I'll check again tomorrow to see if they continue.)
Go to the top of the page
 
+Quote Post
Dooga
post Oct 21 2005, 03:19 AM
Post #4


Moderator
Group Icon

Group: [MODERATOR]
Posts: 1,327
Joined: 26-December 04
From: Canada
Member No.: 2,940



Perhaps you're just editing too much? lol! It happens to me...
Go to the top of the page
 
+Quote Post
Lyon2
post Oct 21 2005, 04:27 PM
Post #5


The Ethical Hacker
***********

Group: [HOSTED]
Posts: 1,144
Joined: 27-May 05
From: Portugal (Europe)
Member No.: 7,566



If you want my advice, don't use htaccess, insted use the well known "robots.txt".

If you have a problem building robots.txt, no problem, there's a great tool for that, named :

RoboGen (free edition)
http://www.rietta.com/downloads/robogen_le_setup.exe

This is a fantastic litle tool that has all the most popular spiders,bots,etc, for you to allow or deny the access to them to the pages of your website, it's pretty easy to learn and work with this tool, but if you encounter any problems, just private email me.
Go to the top of the page
 
+Quote Post
jlhaslip
post Oct 24 2005, 06:32 PM
Post #6


A computer once beat me at chess, but it was no match for me at kick boxing.
Group Icon

Group: [MODERATOR]
Posts: 3,874
Joined: 24-July 05
From: In Trouble Again... still?
Member No.: 9,787
Spam Patrol



well, as it turns out, this was all "much ado about nothing".

I banned the suspected IP addresses, used the Robot Link above and then one of the forum users couldn't sign-in. Turns out her satallite connection gets re-directed to an US based facility, so the Bandwidth usage was legitimate.

Just another lesson to be had here.
Go to the top of the page
 
+Quote Post
Dragonfly
post Oct 24 2005, 08:59 PM
Post #7


Privileged Member
*********

Group: Members
Posts: 702
Joined: 17-February 05
Member No.: 3,817



Well, this really interest me too. Google doesn't list my site in any of the search results. However, google bot is almost twenty four hours present in this particular site. And I also noticed that for about 100MB bandwidth has been wasted this way.

I've read the inputs from some members, the best way to control the bots are robots. we can implement this even from meta tag without using any other method. Visit once a month may do. I need to see the particular code and insert it in the meta tag.

I hope more experts will put more inputs here.
Go to the top of the page
 
+Quote Post
Lyon2
post Oct 24 2005, 11:43 PM
Post #8


The Ethical Hacker
***********

Group: [HOSTED]
Posts: 1,144
Joined: 27-May 05
From: Portugal (Europe)
Member No.: 7,566



Dragonfly, yes, you don't need to create the robots.txt file and insert it in the root directory of your website, but if you choose to use the metatag insted, you won't have to much options to protect your website directorys, and beleave me, your site will be extremelly vulnareble to agressive spiders and bots.

There's spiders made by skilled hackers to scan entire websites for vulnereble stuff, and if you don't have a well configured robots.txt file, you'll end up one day searching google for some keywords and your website passwords turne up as the results, like happens to many people.

But, forget about this particullar spiders and bots made by some skilled hackers, and let's talk about the google spider or bot.

Perhaps you have absolutly no idea of what's the REAL POWER OF GOOGLE!

Google can find passwords, usernames, cgi blackholes, sensivity data, vulnereble data, databases usernames and passwords, and millions of private things that most of webdesigners don't even imagine, and why, because they don't care about security, don't don't create the robots.txt file and/or the htaccess files (for linux's servers), because the search engines only speek one language, and that is robots.txt (allow and/or disallow access) and htaccess (also allow and/or disallow access).

If you want to really know all the best techniques to find this secret stuff with google, check the above website, wich the main goal is to help webdesigners protect their websites from "google hackers or google hacking".

You have to register at:
http://johnny.ihackstuff.com

Then visit the google hacking database of querys at:
http://johnny.ihackstuff.com/index.php?module=prodreviews

Now, getting back to robots.txt, a normal robots.txt look like this:


Disallowing all the spiders:

# Your website title -- http://yourwebsitedomainname.domain
# Robot Exclusion File -- robots.txt
# Author: your name
# Last Updated: The date

User-agent: *
Disallow: /dd


This robots.txt code will disallow any search engine of indexing the "dd" directory, but this is just an example.

Also this code was created with the robogen LE:

RoboGen (free edition)
http://www.rietta.com/downloads/robogen_le_setup.exe

If you want to create and edit htaccess files, there's a very usefull and pretty easy to work free tool:

HTAccessible
http://www.tlhouse.co.uk/HTAccessible.shtml

(it's constantly updating with more easy one-click functions to protect your directorys and files with htaccess files, and remember that htaccess files are for linux's servers.)

One more thing, allways create the robots.txt file and insert it on the root directory of your website with the configuration for all of your private and public directorys, or, insert the robots.txt file in the directory that you want to protect, that has only configuration for that directory only, wich i don't recommend.

Also if you choose to configure all your directorys in one robots.txt file, remember to insert the below code in it, for the most important and usual directorys of any website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /scripts/
Disallow: /your private directory 1/
Disallow: /your private directory 2/
Disallow: /your private directory 3 and so on/

I'm sure you understand that if you don't say to the search engines to not index the cgi-bin, images, scripts of your website (using the robots.txt file), your sensivity website data will end up in the results of other people searches on google, yahoo and/or altavista wich are the most powerfull search engines on the web.

So, to protect your cgi-bin directory wich is one of the main targets for hackers (website defacers, script kiddies, crackers), you'll have to allways insert the disallow code to this directory.

The images directory is optional, if you don't want google images spider to index your images because you have worked to much in those, i advice you to disallow the access too.

The scripts directory is also a main target, specially if you have php and cgi scripts, so, if you want to protect your work and scripts configuration, also disallow the access to this one too.

And there's much more sensitive directorys that you should, no, you must protect, wich could be, for example:

- email;
- newsletter;
- mailing lists;
- spreadsheets (excel data);
- and much more.

This directorys depend of what your website has and what has to offer, for example, if you sell templates, ebooks, videos-tutorials, you'll have also to protect this directorys or you'll end up giving all of your work to website scanners, wich by the way, it's happening all the time to webdesign beginners with no experience.


One more extremelly important thing, if you usually work with cgi or perl, specially cgi, wich is a litle bit different of perl, be very carefull with the scripts that you use on your websites, cause there are tons of high quality programs to scan cgi websites and cgi scripts in websites, for example:

Cgi Scan
http://217.125.24.22/h/cgiscan.zip

Run the above tool in your website to see if it has cgi black holes, wich are very apreciated by "website defacers, script kiddies and crackers".

To finish this, if you want to learn much more stuff about robots,spiders,search engines and specially google search engine, tell me and i'll send you some high quality ebooks about it.

There's so much to tell and not many time to actually tell it!
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic

Collapse

> Similar Topics

Topics Topics
  1. Concerned Security With Hosting Application Info