Document-Transformer
Document-Transformer copied to clipboard
abut train corpus format
hello~ When I use this code to training a model, What format should be processed for the source corpus, the target corpus, the context corpus? are they tokenized and BPE? Could u send me a demo about it? Thank u very mach.
Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail
Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)
You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail
thank u very much
I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper).
So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English) You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail
thank u very much I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper). So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
My e-mail : [email protected] Thank you very mach
Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English) You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail
thank u very much I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper). So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set?
It seems that you only use one reference in validation when NIST test sets have 4 references. Using more references will result in higher BLEU scores.