self-attentive-parser icon indicating copy to clipboard operation
self-attentive-parser copied to clipboard

How to train on gold tags dataset

Open chiehminwei opened this issue 5 years ago • 4 comments

I have a copy of the revised PennTreebank that looks like the format of the files in data/. However, the code breaks when I try to use these files. On further inspection, I'm guessing I need to insert a "TOP" tag at the start of every sentence? I did that and the model starts training, but then the EVAL script doesn't work. My copy of the treebank is somehow also missing a sentence. Is this what's causing the problem for the EVAL script? Can I just copy and paste the sentence that's missing from the silver trees you provided?

chiehminwei avatar Apr 21 '19 09:04 chiehminwei

This is the error I got from the EVAL script. I'm guessing my data is corrupted. Where do you obtain the data? Do you use the revised PennTreebank, or do you use some script to convert from PennTreebank 3.0? 2 : Length unmatch (43|42) 4 : Words unmatch (self|`) 8 : Length unmatch (24|25) 17 : Words unmatch (-LRB-|D.) 53 : Length unmatch (35|34) 58 : Length unmatch (36|34) 76 : Length unmatch (30|31) 80 : Length unmatch (52|49) 82 : Length unmatch (54|53) 86 : Length unmatch (27|26) 97 : Length unmatch (31|29) 99 : Length unmatch (46|44) 104 : Length unmatch (28|27) 107 : Length unmatch (51|49) 110 : Length unmatch (31|29) 132 : Length unmatch (15|16) 171 : Length unmatch (19|20) 172 : Length unmatch (14|16) 177 : Length unmatch (24|22) 204 : Length unmatch (12|13) 208 : Length unmatch (51|49) 216 : Length unmatch (37|36) 219 : Length unmatch (27|28) 244 : Length unmatch (17|15) 287 : Length unmatch (26|24) 317 : Length unmatch (14|15) 326 : Length unmatch (32|31) 339 : Length unmatch (38|36) 361 : Length unmatch (55|56) 370 : Length unmatch (38|39) 423 : Length unmatch (30|31) 424 : Length unmatch (18|20) 431 : Length unmatch (40|39) 435 : Length unmatch (31|29) 462 : Length unmatch (30|29) 466 : Length unmatch (39|37) 470 : Length unmatch (16|15) 471 : Length unmatch (23|22) 475 : Length unmatch (37|36) 476 : Length unmatch (29|30) 479 : Length unmatch (21|22) 484 : Length unmatch (20|18) 488 : Length unmatch (9|8) 489 : Length unmatch (18|17) 491 : Length unmatch (30|31) 511 : Length unmatch (33|34) 515 : Length unmatch (21|22) 525 : Length unmatch (28|29) 534 : Length unmatch (24|25) 546 : Length unmatch (11|12) 585 : Length unmatch (24|22) 586 : Length unmatch (42|43) 595 : Length unmatch (21|20)

chiehminwei avatar Apr 22 '19 02:04 chiehminwei

I posted treebank conversion scripts at https://github.com/nikitakit/parser-data-gen

These scripts are able to recover the gold-tag data format I have directly from the LDC release.

When it comes to EVALB errors, it's actually normal to see some for first 1-2 epochs of training (and especially the first time a model is evaluated on the dev set). In fact for a randomly-initialized parser EVALB might just crash instead of returning a very low accuracy. Errors should go away after a parser has been trained for a few epochs and starts producing reasonable/non-random outputs.

The "length unmatch" message can occur when predicted punctuation tags differ from the gold tags, because punctuation is excluded from length calculation in the standard evaluation. The "words unmatch" error, on the other hand, looks like a potential data processing issue.

nikitakit avatar Apr 24 '19 21:04 nikitakit

Thank you so much for the scripts! They're really helpful. I've successfully converted PTB3.0 and EVALB is looking good. Did you use the same scripts for converting Chinese treebank? Where should I place the files for Chinese?

chiehminwei avatar Apr 25 '19 00:04 chiehminwei

I added a CTB processing script as well: https://github.com/nikitakit/parser-data-gen/blob/master/corpora/ctb_5.1/build_corpus.sh

You'll have to change the reference to ${HOME}/data/ctb_5.1/ to instead point to the right location on your machine.

nikitakit avatar Apr 28 '19 04:04 nikitakit