gateplugin-LearningFramework issues

Add example pipelines

Initially: * Train Topic Model: ideally this would also make use of the example pipelines from stringannotation for token filtering by stopwords and corpusstats for token filtering by tfidf, but...

johann-petrak

Better messages when users are using PRs incorrectly

Currently some ways of not using a PR properly, either by using the wrong PR or by setting the parameters in a way that is not proper for a corpus...

johann-petrak

enhancement

Add support for topic models by wrapping gensim

1

This will need an even simpler "corpus representation" for text (list of tokens) only.

johann-petrak

Adapt script calling parameters for training and application for the dense json engine

The parameters for train.py have been changed to make use from the command line easier, need to adapt the engine code to do this right. May need to adapt the...

johann-petrak

On-demand caching/reusing of the training set for algorithm/hyperparameter exploration

It would be good to have some way to run the training PR on a cached training set, only changing the training algorithm or hyperparameters. This should work even for...

johann-petrak

Properly implement classification confidence scores

3

This is a bit messy at the moment: make sure we always assign the correct confidence scores to a classification (and if possible, all class labels) if the algorithm returns...

johann-petrak

Refactor creating class annotation code, move from ModelApplication to SeqEncoder

Ideally, we should have both encoding and decoding code in the SeqEncoder (mabe rename to SeqEncoderDecoder) classes but currently the decoding is in the ModelApplication class. Needs some refactoring and...

johann-petrak

(Temporarily) Merge in some code for mavenization, then factor out again

2

This is for speeding up the mavenization and getting rid of some obstacles quickly: the LearningFramework depends on a couple of libraries which would have to be available on Maven...

johann-petrak

Dense JSON Corpus Representation: consider using gzip compression

Either optionally, or by default write both data and metadata gzip-compressed and make the python library deal with it properly.

johann-petrak

enhancement

Support out-of-core exporting and training and alternate corpus representations

7

We at least must support out of core exporting, ideally would also support OOC training for some engines or algorithms. Maybe for wrapping https://github.com/JohnLangford/vowpal_wabbit and neural networks as well as...

johann-petrak

gateplugin-LearningFramework
gateplugin-LearningFramework copied to clipboard

Metadata

Add example pipelines

Better messages when users are using PRs incorrectly

Add support for topic models by wrapping gensim

Adapt script calling parameters for training and application for the dense json engine

On-demand caching/reusing of the training set for algorithm/hyperparameter exploration

Properly implement classification confidence scores

Refactor creating class annotation code, move from ModelApplication to SeqEncoder

(Temporarily) Merge in some code for mavenization, then factor out again

Dense JSON Corpus Representation: consider using gzip compression

Support out-of-core exporting and training and alternate corpus representations

← Metadata

Owner

Metadata

gateplugin-LearningFramework gateplugin-LearningFramework copied to clipboard

Metadata

← Metadata

Owner

Metadata

gateplugin-LearningFramework
gateplugin-LearningFramework copied to clipboard