udify
udify copied to clipboard
Integrate UDify into AllenNLP
It would be useful to integrate the UDify model directly into AllenNLP as a PR, as the code merely extends the library to handle a few extra features. Since the release of the UDify code, AllenNLP also has added a multilingual UD dataset reader and a multilingual dependency parser with a corresponding model, which should make things easier.
Here is a list of things that need to be done:
- [ ] Add scripts to download and concatenate the UD data for training/evaluation. Also, add the CoNLL 2018 evaluation script.
- [ ] Create a UDify conllu -> conllu predictor that can handle unseen tokens and multiword ids.
- [ ] Add the sqrt learning rate decay LR scheduler.
- [ ] Add optional dropout to ScalarMix.
- [ ] Modify the multilingual UD dataset reader to handle multiword ids.
- [ ] Add lemmatizer edit script code.
- [ ] Modify the BERT token embedder to be able to return multiple scalar mixes, one per task (or alternatively all the embeddings). Add optional args for internal BERT dropout.
- [ ] Add generic dynamic masking functions.
- [ ] Add the custom sequence tagger and biaffine dependency parser that handles a multi-task setup.
- [ ] Add the UDify main model, wrapping the BERT, dynamic masking, scalar mix, sequence tagger, and dependency parser code. Provide custom metrics for TensorBoard.
- [ ] Add utility code to optionally cache the vocab and grab UD treebank names from files.
- [ ] Add helper script to evaluate conllu predictions and output them to json.
- [ ] Add tests to verify the new UDify model and modules.
- [ ] Add UDify config jsonnet file.