Document-Transformer icon indicating copy to clipboard operation
Document-Transformer copied to clipboard

abut train corpus format

Open Rooders opened this issue 4 years ago • 4 comments

hello~ When I use this code to training a model, What format should be processed for the source corpus, the target corpus, the context corpus? are they tokenized and BPE? Could u send me a demo about it? Thank u very mach.

Rooders avatar Aug 26 '20 09:08 Rooders

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

Glaceon31 avatar Aug 28 '20 02:08 Glaceon31

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English)

You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper). So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set? image

Rooders avatar Aug 30 '20 03:08 Rooders

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English) You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper). So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set? image

My e-mail : [email protected] Thank you very mach

Rooders avatar Aug 30 '20 04:08 Rooders

Using the same processing steps as standard NMT systems is OK (e.g. tokenization and BPE for English) You may refer to the user manual at https://github.com/THUNLP-MT/THUMT for detail

thank u very much I meet anther issue now, when I use THUMT code to train a standard sentence-level Transformer model,it achieved only 29 BLEU score in validation set and my training set is LDC zh-en corpus(2M sentence-pair), validation set is MT06, which is same as your paper,but haven‘t got same BLEU score(it is 48.09 in your paper). So I guess my parameters setting maybe not same as yours ,under is mine,if it is impossible ,can u sent your parameter set? image

It seems that you only use one reference in validation when NIST test sets have 4 references. Using more references will result in higher BLEU scores.

Glaceon31 avatar Sep 02 '20 09:09 Glaceon31