nutch crawl issue

matthew a. grisius


using Nutch nightly build nutch-2010-04-27_04-00-28:

I am trying to bin/nutch crawl a single html file generated by javadoc
and no links are followed. I verified this with bin/nutch readdb and
bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
seed doc specified is processed.

I searched and reviewed the nutch-user archive and tried several
different settings but none of the settings appear to have any effect.

I then downloaded maven-2.2.1 so that I could mvn install tika and
produce tika-app-0.7.jar to command line extract information about the
html javadoc file. I am not familiar w/ tika but the command line
version doesn't return any metadata, e.g. no 'src=' links from the html
'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
nutch uses tika and maybe it's not related . . .

Has anyone crawled javadoc files or have any suggestions? Thanks.


