TeXoo icon indicating copy to clipboard operation
TeXoo copied to clipboard

TeXoo – A Zoo of Text Extractors

Results 10 TeXoo issues
Sort by recently updated
recently updated
newest added

If you want to train the MentionAnnotator with trigrams as inputs, there is a `NullPointerException` in: https://github.com/sebastianarnold/TeXoo/blob/514860d96decdf3ff6613dfcf0d27d9845ddcf60/texoo-entity-recognition/src/main/java/de/datexis/ner/exec/TrainMentionAnnotatorCoNLL.java#L110 This is because the `params.embeddingsFile` is `null` (as expected for trigrams see [Line...

If you call `de.datexis.index.ArticleIndexFactory.loadWikiDataIndex()` with an InternalResource created with `Resource.fromJAR()` as `cacheDir` an `java.lang.IllegalArgumentException: Prefix string too short` exception is thrown. Generally, this is a bad idea to do, but...

bug

ObjectSerializer causes `java.lang.IllegalStateException: zip file closed` if called for the first time in a parallel stream. This is caused by the reflections scan of the jar which is not thread-safe....

The optimaize language-detector uses a [quite old guava version](https://github.com/optimaize/language-detector/blob/1a322c462f977b29eca8d3142b816b7111d3fa19/pom.xml#L231) which conflicts with guava 23.0 which is used by [sszuev/fastText_java](https://github.com/sszuev/fastText_java/blob/b7da617478ce7e5c5aba9704828381682f85eec4/pom.xml#L230). ``` Caused by: java.lang.NoSuchMethodError: com.google.common.collect.ImmutableList.copyOf(Ljava/util/Collection;)Lcom/google/common/collect/ImmutableList; at com.optimaize.langdetect.profiles.BuiltInLanguages.(BuiltInLanguages.java:92) at com.optimaize.langdetect.profiles.LanguageProfileReader.readAllBuiltIn(LanguageProfileReader.java:118) at de.datexis.preprocess.DocumentFactory.(DocumentFactory.java:79)...

When training with EarlyStopping the training seems to get stuck after very few iterations (8-9 batches a 16 examples) The training seems to be frozen but no exception is thrown...

`de.datexis.model.Dataset` currently has no way to remove Documents but to get the Collection and modify it; is this intended?

enhancement

- Change serialization in de.datexis.model to serialize entire Document tree - Introduce wrappers for TASTY Json and reduced Dataset export (e.g. no sentence splitting)

enhancement

This test has no assertion and is just logging things to console. We should consider changing this. Additionally, the method name is not well expressing what is tested. https://github.com/sebastianarnold/TeXoo/blob/32f13a1d420d5a2c2407593a68a7cc0e8d5c484c/texoo-entity-linking/src/test/java/de/datexis/nel/NamedEntityAnnotatorTest.java#L43-L55

Not completely decided yet: - Treat Tags as Annotations - Make sure Annotations can be added to Spans

enhancement
question

One team created annotations for serializing de.datexis.model Document model to database. Please include this in master.

enhancement