Java Mailing List Archive

Home » nutch-user.lucene »

Ignoring Robots.txt

Super Man


Replies: Find Java Web Hosting

Author LoginPost Reply

I want to crawl a website which denies access to all crawlers. The
website is our own, so there are no issues with crawling it, but the
sysadmin doesnt want to change the robots.txt for the fear that we get
many other impersonated crawlers once we allow a crawler.

Is it possible to configure nutch to ignore robots.txt? I set the
Protocol.CHECK_ROBOTS property to false in nutch-site.xml, but that
doesnt seem to help.

Any clues?

©2008 - Jax Systems, LLC, U.S.A.