behemoth icon indicating copy to clipboard operation
behemoth copied to clipboard

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Results 13 behemoth issues
Sort by recently updated
recently updated
newest added

Hi @jnioche , Please see the patch below which does the following - makes Annotation’s serializable, meaning we can export them alongside document text as either binary or textual content....

[https://github.com/ept/warc-hadoop] could be used as a dependency for handling the WARC format in Hadoop. This would be cleaner than having a copy of the lemurproject classes as we currently do.

Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.

Hi Julien, We will be working with cTAKES [0] most likely over the next while. I would really like to run it on Behemoth. I'll try work on this and...

The tests create files like /tmp/sfcmt/.foo.crc These are owned by the first person to do the tests, so might not be created or manipulated by the second person to run...

e.g. document.filter.mimetype.skip We have a positive one already

It could make sense to have the possibility to show (via CorpusReader ) only the annotations matching a given regex defined via generic parameter .

The jobs currently generate a new seqfile. it would be great to have a '-r input' option to replace the input with the output if the job is successful. We'd...