Java Mailing List Archive

Home » nutch-user.lucene »

Issues in recrawling

arpit khurdiya


Author LoginPost Reply

I m new to the world of nutch. I am trying to crawl  local file
systems on LAN using nutch 1.0. Documents are rarely modified and then
search them using solr. And frequency of recrawling is 1 day as
documents are frequently added and deleted. I have few queries
regarding recrawling.

1. What is the major difference between bin/nutch crawl Command and
the recrawling script given in wiki? is it just that the script merges
the segments? I more curious on the performance issue.

2. Is there any way to inform Solr Index to delete a particular
document as that resource do not exist any longer after recrawling? I
dont want create a new SolrIndex every time i crawl, i want to update
my index.

3. As documents are rarely modified i want them to be fetched only
when they get modified. But, after interval.default is exceeded, the
document is fetched without taking into consideration whether the
document has been modified or not. Is there any way around of fetching
only those documents that are newly added or those that have been

Thanks a lot..

Arpit Khurdiya
©2008 - Jax Systems, LLC, U.S.A.