Top » Crawler FAQ
  • directory
  •  
 

 

Crawler FAQ

A web crawler (also known as a web spider or ant) is a program which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. From Wikipedia, the free encyclopedia.

? Why is Crinx grabbing webpages from my website?

Crinx is the agent software of allcoins.org which crawls web sites all over the world, in order to build a vertical search engine, in our case a Numismatic search engine.

? I do not want my website to be crawled, what should I do?

You can put a file named robot.txt in your web server. It is a standard way to exclude robot programs from retrieving parts or whole of your web site. For a detailed description about robot.txt, please refer : http://www.robotstxt.org/wc/norobots.html

? Why does Crinx try to access some non-existing URLs from my website?

There might be some places in the web that have some stale URLs pointing to some non-existing URLs in your web site. Crinx crawls the web by following links in the pages it gathered, and thus could access some non-existing links.

? Why doesn't Crinx obey my robots.txt?

We always suggest verifying that your syntax is correct against the standard at robots exclusion. A common source of problems is that the robots.txt file isn't placed in the top directory of the server (e.g., www.mydomain.com/robots.txt); placing the file in a subdirectory won't have any effect.
Crinx obeys the longest (that is, the most specific) applicable rule. This more intuitive practice matches what people actually do, and what they expect us to do.
For example, consider the following robots.txt file:
User-Agent: *
Disallow: /cgi-bin

It's obvious that the webmaster's intent here is to allow robots to crawl everything except the /cgi-bin directory. Consequently, that's what we do.

To prevent your site from being crawled by Crinx, you may add the following lines in your robots.txt

User-agent: Crinx
Disallow: /

Crinx does try to follow the robots.txt by filtering out URLs that are specified in the robot exclusion database. Once Crinx has noticed your robots.txt and learned the rule, it will not grab web pages listed in your robots.txt after then. Should there be still a question, please email