Java Mailing List Archive

Home » nutch-user.lucene »

JobTracker gets stuck with DFS problems

Emmanuel de Castro Santana


Replies: Find Java Web Hosting

Author LoginPost Reply
Hi All

We are using Nutch to crawl ~500K pages with a 3 node cluster, each node
features a dual core processor running with 4Gb RAM and circa 100Gb storage.
All nodes run on CentOS.

These 500K pages are scattered into several sites, each one of them having
from 5k up to 200k pages. For each site we start a different crawl process
(using bin/nutch crawl), but they are all almost simultaneously started.

We are trying to tune Hadoop's configurations in order to have a reliable
daily crawling process. After a while of crawling we see some problems
occurring, mainly on the TaskTracker nodes, most of them are related to
access to the HDFS. We often see "Bad response 1 for block" and "Filesystem
closed", among others. When these errors start to get more frequent, the
JobTracker gets stuck and we have to run stop-all. If we adjust the maximum
of map and reduce tasks to lower values, the process takes longer to get
stuck, but we haven't found the adequate configuration yet.

Given that setup, there are some question we have been struggling to find an

1. What could be the most probable reason for the hdfs problems ?

2. Is it better to start a unique crawl with all sites inside or to just
keep it the way we are doing (i.e start a different crawl process for each
site) ?

3. When it all goes down, is there a way to restart crawling from where the
process stopped ?

Thanks in advance

Emmanuel de Castro Santana
©2008 - Jax Systems, LLC, U.S.A.