You want some web crawlers to access your site, Google and Bing for example.
But most of them are just nuisances designed for data collection, or worse spying purposes, of which the data is then made publicly available for all to see, or resold to your competitors.
Putting an end to unwanted web crawling
I saw these 2 files posted on wickedfire and thought they should be shared here also. both contain a fairly extensive list of undesirable web crawlers that you’d probably rather not want crawling around your site.
Robots.txt is the "honesty principle" file, it respectfully asks crawlers not to crawl your site. it can be ignored by crawlers that choose not to respect it. Personally I still use it, because there’s no harm.
.htaccess is the tougher blockade and actually places a virtual wall in front of those crawlers, preventing access to the site. It’s been uploaded as htaccess.txt but be sure to rename it as .htaccess before uploading it to the root folder of your site (robots.txt also goes in the root domain folder).