self-attentive-parser How to train on gold tags dataset

I have a copy of the revised PennTreebank that looks like the format of the files in data/. However, the code breaks when I try to use these files. On further inspection, I'm guessing I need to insert a "TOP" tag at the start of every sentence? I did that and the model starts training, but then the EVAL script doesn't work. My copy of the treebank is somehow also missing a sentence. Is this what's causing the problem for the EVAL script? Can I just copy and paste the sentence that's missing from the silver trees you provided?

Apr 21 '19 09:04 chiehminwei

Apr 22 '19 02:04 chiehminwei

I posted treebank conversion scripts at https://github.com/nikitakit/parser-data-gen

These scripts are able to recover the gold-tag data format I have directly from the LDC release.

When it comes to EVALB errors, it's actually normal to see some for first 1-2 epochs of training (and especially the first time a model is evaluated on the dev set). In fact for a randomly-initialized parser EVALB might just crash instead of returning a very low accuracy. Errors should go away after a parser has been trained for a few epochs and starts producing reasonable/non-random outputs.

The "length unmatch" message can occur when predicted punctuation tags differ from the gold tags, because punctuation is excluded from length calculation in the standard evaluation. The "words unmatch" error, on the other hand, looks like a potential data processing issue.

Apr 24 '19 21:04 nikitakit

Thank you so much for the scripts! They're really helpful. I've successfully converted PTB3.0 and EVALB is looking good. Did you use the same scripts for converting Chinese treebank? Where should I place the files for Chinese?

Apr 25 '19 00:04 chiehminwei

I added a CTB processing script as well: https://github.com/nikitakit/parser-data-gen/blob/master/corpora/ctb_5.1/build_corpus.sh

You'll have to change the reference to ${HOME}/data/ctb_5.1/ to instead point to the right location on your machine.

Apr 28 '19 04:04 nikitakit

self-attentive-parser self-attentive-parser copied to clipboard

How to train on gold tags dataset

self-attentive-parser
self-attentive-parser copied to clipboard