behemoth issues

Make Annotation's serializable and initial implementation of adding annotations to the Exporter output

2

Hi @jnioche , Please see the patch below which does the following - makes Annotation’s serializable, meaning we can export them alongside document text as either binary or textual content....

lewismc

Use warc-hadoop library

[https://github.com/ept/warc-hadoop] could be used as a dependency for handling the WARC format in Hadoop. This would be cleaner than having a copy of the lemurproject classes as we currently do.

jnioche

Elasticsearch module

5

Hi @jnioche I'm working on an ES search module as part of using Behemoth in an ongoing project. I'll send you a PR ASAP.

lewismc

CTakes modules for Behemoth

2

Hi Julien, We will be working with cTAKES [0] most likely over the next while. I would really like to run it on Behemoth. I'll try work on this and...

lewismc

Tests cant be run by more than one person

1

The tests create files like /tmp/sfcmt/.foo.crc These are owned by the first person to do the tests, so might not be created or manipulated by the second person to run...

alexmc6

jnioche

behemoth
behemoth copied to clipboard

Metadata

Make Annotation's serializable and initial implementation of adding annotations to the Exporter output

Use warc-hadoop library

Elasticsearch module

CTakes modules for Behemoth

Tests cant be run by more than one person

Add negative filter for mimetype

CorpusReader generic parameter for annotations

Add module for OpenNLP components

switch to new Hadoop API

Options to replace input with output of job

← Metadata

Owner

Metadata

behemoth behemoth copied to clipboard

Metadata

← Metadata

Owner

Metadata

behemoth
behemoth copied to clipboard