Java Mailing List Archive

http://www.java2.5341.com/

Home » nutch-user.lucene »

Re: Scheduler questions, 1.1 nightly build.

Phil Barnett

2010-04-22

Replies: Find Java Web Hosting

Author LoginPost Reply
I should add that what I really want to do is toss all previous crawl
information and reindex everything every night. It's just a few servers and
very low impact. My crawl on 1.0 takes about 10 minutes.

On Thu, Apr 22, 2010 at 4:59 AM, Phil Barnett <philb@(protected):

> I'm having a problem where shouldfetch is rejecting everything. I have
> deleted the crawl directory and started the entire crawl from scratch by
>
> rm -rf crawl
> mkdir crawl
> mkdir segments
>
> I'm absolutely baffled by how this scheduler works.
>
> Is there documentation?
>
> Is the fetchtime saved somewhere other than the crawl database?
>
> I have tried lowering
>
> db.default.fetch.interval to 0
> db.fetch.interval.default to many lower values
> db.fetch.interval.max to different levels.
>
> With those changed, it crawls the top page over and over again. I make them
> a little larger and it rejects the top page.
>
> I'd really like to see how this tika parser works, but I can't get any web
> pages into the crawl database.
>
> The last thing I tried was to remove the entire /opt/nutch-1.1 directory
> and start from scratch. It made no difference.
>
> Is this a bug or am I doing something stupid?
>
©2008 java2.5341.com - Jax Systems, LLC, U.S.A.