NMT2017-ZH-EN
NMT2017-ZH-EN copied to clipboard
Reproducibility issue when training on a smaller dataset and fewer GPUs
Hi:
Just want to know How to replicate the result you mentioned on README, The model reaches 20 BLEU on testing dataset, after training for only 2 epochs
.
I simple used your setup to train my model, however after 3 epochs, I got
020-06-03 17:49:03 | INFO | fairseq_cli.generate | Generate test with beam = 5: BLEU4 = 0.09, 7.5/0.7/0.0/0.0 (BP=1.000, ratio=1.996, syslen=289332, reflen=144951)
my generate-script is
fairseq-generate data-bin/wmt17_zh_en \
--path checkpoints/checkpoint_best.pt \
--batch-size 128 --beam 5 --remove-bpe
and the training data I used are:
- training-parallel-nc-v12
- United Nations Parallel-enzh
Thank you!
Your evaluate script looks legit to me, this's so weird. Could you provide more details like the training loss and ppl curve? It can be drawn by the script provided in the repo.
Hi @STayinloves :
Here is the result after I executed the script that you provided, besides I am not using Jupyter so I add plt.show()
in the very end of file.
so, I also upload train.log.
Thank you again!
You might want to see if checkpoint_last.pt
give you different results.
I got an zero, here is the result:
2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Generate test with beam=5: BLEU4 = 0.00, 5.4/0.0/0.0/0.0 (BP=0.448, ratio=0.554, syslen=80370, reflen=144951)
Your train.log
says that you only have 15 examples in the validation set, this's probably wrong, I'm wondering whether the same mistake happens to the testing set.
that's weird, since I download them from WMT and make sure files aren't wrong. here is how I do pre-process:
- download them in
./dataset
- and put those files in test/valid/train just like you, and we use the same test/valid dataset
- run prepare.sh
2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Translated 8037 sentences (88407 tokens) in 14.6s (551.45 sentences/s, 6065.99 tokens/s)
I think test examples are fine...
Thank you for your response
Update: I re-executed the preprocess and I am able to create 1996 sentences instead of 15 examples you mentioned above.
my preprocess.log
Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wmt17_zh_en', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, quantization_config_path=None, seed=1, source_lang='zh', srcdict=None, target_lang='en', task='translation', tensorboard_logdir='', testpref='dataset//test.32000.bpe', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='dataset//train.32000.bpe', user_dir=None, validpref='dataset//valid.32000.bpe', workers=12)
[zh] Dictionary: 36495 types
[zh] dataset//train.32000.bpe.zh: 222476 sents, 5624865 tokens, 0.0% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//valid.32000.bpe.zh: 1996 sents, 58897 tokens, 0.278% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//test.32000.bpe.zh: 2001 sents, 56962 tokens, 0.365% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//train.32000.bpe.en: 222476 sents, 6106080 tokens, 0.0% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//valid.32000.bpe.en: 1996 sents, 68078 tokens, 0.00881% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//test.32000.bpe.en: 2001 sents, 63675 tokens, 0.00471% replaced by <unk>
Wrote preprocessed data to data-bin/wmt17_zh_en
it seems great, however, after 1 epoch training, I still got 0.15, since it is a huge difference between 20 and 0.15, just want to know, if I did something wrong, or I should be patient just wait for the result.
I upload the train.log in here, sorry for my lack of experience.
I would say just wait for one or two epochs to say, the model changes dramatically during the first few updates especially under the warmup scheduler. You can check the loss as an indicator.
I worked on this repo one year ago, I don't quite remember whether it differs by runs or seeds. But I did notice it will reach nearly a performance upper bound in the first few epochs.
There's nothing wrong with a lack of experience :)
after 200,000 updates it is still 0.12, so, I guess something went wrong. maybe I'll use a smaller dataset and model to do the experiment.
but, still thank your response.
You can try the interactive command to check some model output manually, a smaller dataset is also a good starter.
after changed to a smaller dataset (training-parallel-nc-v12.tgz), and it's still the same result, I guess it's something went wrong on pre-process step, and I still cannot replicate the result. Is there anything that I need to do before execute those scripts?
I just noticed a few facts that I was unaware of in our previous discussion.
- The training script can be affected by the number of GPUs available since it only limits the
--max-tokens
per GPU. So more GPUs will lead to a larger batch size in training. I use 6 GPUs previously while you seem to use 1 GPU (--update-freq
setting can be helpful in this case). It's my fault that I didn't notice this in the repo, sorry for that.
Unfortunately, I don't currently have the resource to train a model on the full dataset, but based on the observation in my little experiment on training-parallel-nc-v12.tgz
today (I download and run from scratch and will update the result later) I didn't find any other steps to add to the pre-processing step. I found my old training log and will attach it here.
train_wmt17_zh_en.log
I hope this helps!
Update on my experiment yesterday, I tried to train the model on training-parallel-nc-v12.tgz
only (~200k examples) (I use --update-freq
to ensure a similar batch size), it doesn't work. I observed the validation loss went up while the model could only output random fluent sentences. Then I switch to the full dataset (~20m examples), after one epoch (2.5 hours on 4 GTX 2080 Ti) I got BLEU4=18.89
on the testing set. So I suspect the model configuration cannot be trained on a small dataset easily.
It helps a lot!!
I've tried transformer_iwslt_de_en
and other models and turn out it doesn't work.
so, I guess training transformer dataset is quite important, anyway, you really save my day!
Adding to the discussion about different batch sizes, according to the results on Popel and Bojar, “Training Tips for the Transformer Model.” figure 5 and 6, when training big model, small batch size can lead to failure.
@STayinloves It helps a lot!! I'll try an even bigger batch size, and also thanks for your help
@sanxing-chen Hi, can you please guide me about full dataset (~20m examples), from where I can get it. Thanks
Hi @afaq-ahmad :
after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.
Hi @afaq-ahmad :
after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.
Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:
!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh
--arch transformer --share-decoder-input-output-embed
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000
--dropout 0.2 --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 8192
--eval-bleu
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
--eval-bleu-detok moses
--eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer
Hi @afaq-ahmad : after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.
Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:
!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh --arch transformer --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.2 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8192 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer
You can leverage --fp16
, --max-tokens
normally we set --max-tokens
to be 4k or 3k
I also noticed that you didn't use --update-freq
since you are using one gpu for training, you need to set it to be 4
I only have a 1.05m sentences. How much can I adjust the batchsize or other parameters to achieve good results?The following are my training parameters and bleu values
CUDA_VISIBLE_DEVICES=0 nohup fairseq-train ${data_dir}/data-bin
-a transformer --optimizer adam --source-lang ${src} --target-lang ${tgt}
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000
--lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 200000
--warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001'
--adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0
--update-freq 4 --max-epoch 25
--tensorboard-logdir ~/nmt/log/tensorboardlog_tc4
--keep-last-epochs 2 --save-dir ${model_dir}/checkpoints_tc4 > ~/nmt/log/train_tc4.log 2>&1 &
BLEU = 21.13, 55.6/27.2/15.2/9.0 (BP=0.992, ratio=0.992, hyp_len=549536, ref_len=553932)
Hi @sunyi1123,
You can play around warmup-updates
, label-smoothing
, and dropout
. You can also apply a skill called "back translation". You firstly train a reverse-side MT model and use this trained model to translate rever-side sentences. This way, you will end up with 2x data.