|
|
|
|
![]() ![]() |
Oct 20 2005, 03:18 AM
Post
#1
|
|
|
A computer once beat me at chess, but it was no match for me at kick boxing. ![]() Group: [MODERATOR] Posts: 3,874 Joined: 24-July 05 From: In Trouble Again... still? Member No.: 9,787 ![]() |
I am curious about how to block the bots and / or spiders from accessing my subdomain.
I believe that I may have one using a terrible amount of Bandwidth. Is this possible? and how do I stop it? |
|
|
|
Oct 21 2005, 01:02 AM
Post
#2
|
|
|
Trap Grand Marshal Member ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 1,183 Joined: 24-September 04 Member No.: 1,245 |
bots would
QUOTE using a terrible amount of Bandwidth ???that's the first time i got that ? is it really,how you got the results that it waste you so much bandwidth? and by the way banned the bots is not clever thing i think.if Google search results not containng your site.then there is no more new guests to view your site. |
|
|
|
Oct 21 2005, 02:22 AM
Post
#3
|
|
|
A computer once beat me at chess, but it was no match for me at kick boxing. ![]() Group: [MODERATOR] Posts: 3,874 Joined: 24-July 05 From: In Trouble Again... still? Member No.: 9,787 ![]() |
The site I have uploaded is not yet public. I also host a forum which is private. So far this month, about one third of my bandwidth is showing as being used by enquiries to/from the USA.
Nothing against the Americans, of course, but the membership of the forum using my hosting account here at the trap consists entirely of Canadian members. And since the only place that I have announced the site is here at the trap17, it really is not public. The 'wasted/lost' bandwidth this month alone is over 180 megs out of my allowed 512 megs., so I am trying to put a stop to it. I suspect someone is hijacking my bandwidth. Don't know how or why, but it would be nice to put a stop to it... Any Ideas??? (I've placed IP bans on a couple of them. I'll check again tomorrow to see if they continue.) |
|
|
|
Oct 21 2005, 03:19 AM
Post
#4
|
|
|
Moderator ![]() Group: [MODERATOR] Posts: 1,327 Joined: 26-December 04 From: Canada Member No.: 2,940 |
Perhaps you're just editing too much? lol! It happens to me...
|
|
|
|
Oct 21 2005, 04:27 PM
Post
#5
|
|
|
The Ethical Hacker ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: [HOSTED] Posts: 1,144 Joined: 27-May 05 From: Portugal (Europe) Member No.: 7,566 |
If you want my advice, don't use htaccess, insted use the well known "robots.txt".
If you have a problem building robots.txt, no problem, there's a great tool for that, named : RoboGen (free edition) http://www.rietta.com/downloads/robogen_le_setup.exe This is a fantastic litle tool that has all the most popular spiders,bots,etc, for you to allow or deny the access to them to the pages of your website, it's pretty easy to learn and work with this tool, but if you encounter any problems, just private email me. |
|
|
|
Oct 24 2005, 06:32 PM
Post
#6
|
|
|
A computer once beat me at chess, but it was no match for me at kick boxing. ![]() Group: [MODERATOR] Posts: 3,874 Joined: 24-July 05 From: In Trouble Again... still? Member No.: 9,787 ![]() |
well, as it turns out, this was all "much ado about nothing".
I banned the suspected IP addresses, used the Robot Link above and then one of the forum users couldn't sign-in. Turns out her satallite connection gets re-directed to an US based facility, so the Bandwidth usage was legitimate. Just another lesson to be had here. |
|
|
|
Oct 24 2005, 08:59 PM
Post
#7
|
|
|
Privileged Member ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Members Posts: 702 Joined: 17-February 05 Member No.: 3,817 |
Well, this really interest me too. Google doesn't list my site in any of the search results. However, google bot is almost twenty four hours present in this particular site. And I also noticed that for about 100MB bandwidth has been wasted this way.
I've read the inputs from some members, the best way to control the bots are robots. we can implement this even from meta tag without using any other method. Visit once a month may do. I need to see the particular code and insert it in the meta tag. I hope more experts will put more inputs here. |
|
|
|
Oct 24 2005, 11:43 PM
Post
#8
|
|
|
The Ethical Hacker ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: [HOSTED] Posts: 1,144 Joined: 27-May 05 From: Portugal (Europe) Member No.: 7,566 |
Dragonfly, yes, you don't need to create the robots.txt file and insert it in the root directory of your website, but if you choose to use the metatag insted, you won't have to much options to protect your website directorys, and beleave me, your site will be extremelly vulnareble to agressive spiders and bots.
There's spiders made by skilled hackers to scan entire websites for vulnereble stuff, and if you don't have a well configured robots.txt file, you'll end up one day searching google for some keywords and your website passwords turne up as the results, like happens to many people. But, forget about this particullar spiders and bots made by some skilled hackers, and let's talk about the google spider or bot. Perhaps you have absolutly no idea of what's the REAL POWER OF GOOGLE! Google can find passwords, usernames, cgi blackholes, sensivity data, vulnereble data, databases usernames and passwords, and millions of private things that most of webdesigners don't even imagine, and why, because they don't care about security, don't don't create the robots.txt file and/or the htaccess files (for linux's servers), because the search engines only speek one language, and that is robots.txt (allow and/or disallow access) and htaccess (also allow and/or disallow access). If you want to really know all the best techniques to find this secret stuff with google, check the above website, wich the main goal is to help webdesigners protect their websites from "google hackers or google hacking". You have to register at: http://johnny.ihackstuff.com Then visit the google hacking database of querys at: http://johnny.ihackstuff.com/index.php?module=prodreviews Now, getting back to robots.txt, a normal robots.txt look like this: Disallowing all the spiders: # Your website title -- http://yourwebsitedomainname.domain # Robot Exclusion File -- robots.txt # Author: your name # Last Updated: The date User-agent: * Disallow: /dd This robots.txt code will disallow any search engine of indexing the "dd" directory, but this is just an example. Also this code was created with the robogen LE: RoboGen (free edition) http://www.rietta.com/downloads/robogen_le_setup.exe If you want to create and edit htaccess files, there's a very usefull and pretty easy to work free tool: HTAccessible http://www.tlhouse.co.uk/HTAccessible.shtml (it's constantly updating with more easy one-click functions to protect your directorys and files with htaccess files, and remember that htaccess files are for linux's servers.) One more thing, allways create the robots.txt file and insert it on the root directory of your website with the configuration for all of your private and public directorys, or, insert the robots.txt file in the directory that you want to protect, that has only configuration for that directory only, wich i don't recommend. Also if you choose to configure all your directorys in one robots.txt file, remember to insert the below code in it, for the most important and usual directorys of any website: User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: /scripts/ Disallow: /your private directory 1/ Disallow: /your private directory 2/ Disallow: /your private directory 3 and so on/ I'm sure you understand that if you don't say to the search engines to not index the cgi-bin, images, scripts of your website (using the robots.txt file), your sensivity website data will end up in the results of other people searches on google, yahoo and/or altavista wich are the most powerfull search engines on the web. So, to protect your cgi-bin directory wich is one of the main targets for hackers (website defacers, script kiddies, crackers), you'll have to allways insert the disallow code to this directory. The images directory is optional, if you don't want google images spider to index your images because you have worked to much in those, i advice you to disallow the access too. The scripts directory is also a main target, specially if you have php and cgi scripts, so, if you want to protect your work and scripts configuration, also disallow the access to this one too. And there's much more sensitive directorys that you should, no, you must protect, wich could be, for example: - email; - newsletter; - mailing lists; - spreadsheets (excel data); - and much more. This directorys depend of what your website has and what has to offer, for example, if you sell templates, ebooks, videos-tutorials, you'll have also to protect this directorys or you'll end up giving all of your work to website scanners, wich by the way, it's happening all the time to webdesign beginners with no experience. One more extremelly important thing, if you usually work with cgi or perl, specially cgi, wich is a litle bit different of perl, be very carefull with the scripts that you use on your websites, cause there are tons of high quality programs to scan cgi websites and cgi scripts in websites, for example: Cgi Scan http://217.125.24.22/h/cgiscan.zip Run the above tool in your website to see if it has cgi black holes, wich are very apreciated by "website defacers, script kiddies and crackers". To finish this, if you want to learn much more stuff about robots,spiders,search engines and specially google search engine, tell me and i'll send you some high quality ebooks about it. There's so much to tell and not many time to actually tell it! |
|
|
|
![]() ![]() |
Similar Topics
| Topics | Topics |
|---|---|
|
|