Java Mailing List Archive

Home » nutch-user.lucene »

RE: Language specifications

Arkadi Kosmynin


Replies: Find Java Web Hosting

Author LoginPost Reply
Hi Joshua,

> -----Original Message-----
> From: Joshua J Pavel [mailto:jpavel@(protected)]
> Sent: Friday, 23 April 2010 6:57 AM
> To: nutch-user@(protected)
> Subject: Language specifications
> Alternate question... thanks to everyone who has tried to help me
> through
> the hadoop/AIX issues with 1.0, but I'm going to need to shelf that for
> just a second while I work on some stuff with 0.9 again.
> I need to support one site that has 3 translations: English, French,
> and
> Spanish. The language is specified on each page by tags like the
> following:
> <meta name="language" content="ES"/>
> I would like to have one index but yet restrict my search results based
> upon the "lang=" parameter sent to search.jsp. Is there a way to query
> language specific results only from the index?

Yes, there are at least a couple of ways. The easiest way is just to use Arch and define a separate area for each language. Then you can limit your search to a particular area, depending on the language. See Arch here:

If you don't like easy solutions and don't mind some coding, you can add an extra field called "lang" to your documents by writing a couple of custom filters/plugins extending IndexingFilter and RawFieldQueryFilter in Nutch. For sample code, see how this is done in Arch. It adds several custom fields. The "lang" field is also useful because it is checked by Nutch when choosing an analyser for the document. If you want to use a custom analyser, you have to add this field. Next release of Arch will probably have it. It will be possible to automatically filter on language.  

> And, a bonus question (sorry to put it in the same thread):
> Is there a way to access database information from the Nutch bean? I'd
> like to be able to display (for healthcheck reasons) the total number
> of
> documents in the index.

I guess there are several ways. You can follow calls from Nutch bean in a debugger and see how it works. But, the easiest way (though possibly not the fastest one), is just to submit a trivial query that will match all your documents. For example, if you are indexing, try a query like "host:mysite". This should do for health checks.



> Thanks again!
©2008 - Jax Systems, LLC, U.S.A.