Java Mailing List Archive

Home » nutch-user.lucene »

Parsing .ppt, .xls, .rtf and .doc



Replies: Find Java Web Hosting

Author LoginPost Reply

Hello everyone,

I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But
when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when
I use SegmentReader to get the information of each url I don't find any
parsetext in these formats. I configured the plugins and I allow them to
work. This is the result that I get when I try with a .xls format

Any suggestion about what I'm doing wrong??How can I check if the plugins
are parsing??

Thank you in advance
Sent from the Nutch - User mailing list archive at
©2008 - Jax Systems, LLC, U.S.A.