Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Ignoring Robots.txt

Super Man

2009-09-11

Replies: Find Java Web Hosting

Author LoginPost Reply
Hi,

I want to crawl a website which denies access to all crawlers. The
website is our own, so there are no issues with crawling it, but the
sysadmin doesnt want to change the robots.txt for the fear that we get
many other impersonated crawlers once we allow a crawler.

Is it possible to configure nutch to ignore robots.txt? I set the
Protocol.CHECK_ROBOTS property to false in nutch-site.xml, but that
doesnt seem to help.

Any clues?

Thanks,
Zee
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.