tree_transformer
tree_transformer copied to clipboard
BPE codes
Hi,
nice project, thanks! I'm just trying to replicate your setup for IWSLT'14. Did you change the variable BPE_TOKENS in fairseqs prepare-iwslt14.sh to 32k as mentioned in your paper?
Are you willing to share your bpe codes with me?
Thanks
Hi, I'm struggling creating the data the way you described. I followed the instructions closely and the data after preprocessing with fairseq looks like this:
Some line of test.en:
to these non-@@ engineers , li@@ tt@@ leb@@ its became another material , electr@@ on@@ ics became just another material .
When I preprocess the data afterwards with parse_nmt.py I get the following tree. See that most BPE-tokens (for example in non-engineers ) are not applied, but others are (electr@@). The resulting vocab is 40k, which is nowhere near the 10k from my BPE-codes.
(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe electr@@) (NNS_bpe on@@) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))
The same tree, before-bpe, looks like this:
(ROOT (S (PP (TO to) (NP (DT these) (NNS non-engineers))) (PRN (, ,) (S (NP (NNS littlebits)) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS electronics)) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))
My tries
I thought about reapplying the BPE, so I executed parse_nmt.py with the --convert_bpe option. This applies BPE to all the missing tokens, but also re-applies bpe to the already bpe'd tokens:
(ROOT (S (PP (TO to) (NP (DT these) (NNS (NNS_bpe non-@@) (NNS_bpe engineers)))) (PRN (, ,) (S (NP (NNS (NNS_bpe li@@) (NNS_bpe tt@@) (NNS_bpe leb@@) (NNS_bpe its))) (VP (VBD became) (NP (DT another) (NN material)))) (, ,)) (NP (NNS (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics))) (VP (VBD became) (NP (RB just) (DT another) (NN material))) (. .)))
This produces junk for the tokens where BPE has been applied in the previous step. See for example: (NNS_bpe (NNS_bpe_bpe electr@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe (NNS_bpe_bpe on@@) (NNS_bpe_bpe @@@) (NNS_bpe_bpe @)) (NNS_bpe ics)))
Question
How should I preprocess the IWSLT data to get the correct BPE'd tree?
Thanks!