John Giorgi issues

Results 64 issues of


                                            John Giorgi

Train models for each major entity class

Need to train models for each major entity class: `PRGE`, `LIVB`, `DISO`, `CHED`. The first three are fairly straight-forward. As for the last, there are multiple levels of granularity to...

enhancement

production

Switch to AutoModel API

The PyTorch Transformer library recently added a new `AutoModel` API, which lets you instantiate one of the many pre-trained transformers that are available (BERT, GPT-2, RoBERTa, etc.). We should switch...

enhancement

feature

design

Truncating should effect only the train set

When batching data, Saber truncates / right-pads each sequence to match a length of `saber.constants.MAX_SENT_LEN`. Truncating sequences should only happen on the train set, ensuring that we don't drop examples...

invalid

Pad sequences with something other than keras.preprocessing.text

Currently, we are using `keras.preprocessing.text` to pad sequences. This function is easy to use and convenient, but given that we have dropped Keras support (#157) we will need to find...

chore

Easy way to evaluate a model

There is currently no easy way to evaluate a trained model. There should be some kind of interface for this, e.g. ```python from saber import Saber sb = Saber() sb.load('path/to/some/model')...

enhancement

invalid

Swap out time comments with decorators

Use a decorator to time functions in saber class. https://realpython.com/primer-on-python-decorators/

Loading models that don't exist should throw an error.

In the docs, models for each major entity type are listed, but not all of them are implemented. The user should get an error when they try to load these...

Config file needs to be re-written to better handle multiple models

Currently, the `config.ini` file, which contains settings for using Saber, is highly coupled to the Keras BiLSTM-CRF model. This needs to be fixed. One solution would be to maintain a...

enhancement

invalid

Splitting of training data should be model agnostic.

Currently, the splitting of training data (into a validation or cross-validation split(s)) happens in `prepare_data_for_training()` which is defined by each model It should be moved from the models themselves to...

enhancement

invalid

Switch token alignment to SpaCy

Currently, to align BERT tokens to original tokens (before BERT tokenization) we use some code I grabbed from the official BERT repo. SpaCy has introduced [functions specifically for aligning two...