behemoth
behemoth copied to clipboard
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Hi @jnioche , Please see the patch below which does the following - makes Annotation’s serializable, meaning we can export them alongside document text as either binary or textual content....
[https://github.com/ept/warc-hadoop] could be used as a dependency for handling the WARC format in Hadoop. This would be cleaner than having a copy of the lemurproject classes as we currently do.
Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.
Hi Julien, We will be working with cTAKES [0] most likely over the next while. I would really like to run it on Behemoth. I'll try work on this and...
The tests create files like /tmp/sfcmt/.foo.crc These are owned by the first person to do the tests, so might not be created or manipulated by the second person to run...
e.g. document.filter.mimetype.skip We have a positive one already
It could make sense to have the possibility to show (via CorpusReader ) only the annotations matching a given regex defined via generic parameter .
The jobs currently generate a new seqfile. it would be great to have a '-r input' option to replace the input with the output if the job is successful. We'd...