SCA icon indicating copy to clipboard operation
SCA copied to clipboard

setups for reproducing IWSLT14 De-en

Open BaohaoLiao opened this issue 5 years ago • 4 comments

Hi,

I want to reproduce your result on IWSLT14 De-En, but I can't get 35.78. My best result is 34.25. Here I want to ask some detailed setup:

  1. Do you use share embedding? I don't use. If yes, how about your size of vocabulary.
  2. For language model, I use python ~/fairseq/train.py
    ~/de2en/lmofde
    --task language_modeling
    --arch transformer_lm_iwslt
    --optimizer adam
    --adam-betas '(0.9, 0.98)'
    --clip-norm 0.0
    --lr-scheduler inverse_sqrt
    --warmup-init-lr 1e-07
    --warmup-updates 4000
    --lr 0.0005
    --min-lr 1e-09
    --dropout 0.1
    --weight-decay 0.0
    --criterion label_smoothed_cross_entropy
    --label-smoothing 0.1
    --max-tokens 4096
    --tokens-per-sample 4096
    --save-dir $dir
    --update-freq 16
    --no-epoch-checkpoints
    --log-format simple
    --log-interval 1000
    for both De and En language model. I train the language model until convergence and use the best checkpoint for NMT. Do you have suggestion for my settings?
  3. For NMT, I use python ~/SCA/train.py
    $DATA_PATH
    --task lm_translation
    --arch transformer_iwslt_de_en
    --optimizer adam
    --adam-betas '(0.9, 0.98)'
    --clip-norm 0.0
    --lr-scheduler inverse_sqrt
    --warmup-init-lr 1e-07
    --warmup-updates 4000
    --lr 0.0009
    --min-lr 1e-09
    --dropout 0.3
    --weight-decay 0.0
    --criterion label_smoothed_cross_entropy
    --label-smoothing 0.1
    --max-tokens 2048
    --update-freq 2
    --save-dir $SAVE_DIR
    --tradeoff $i
    --load-lm
    --seed 200
    --no-epoch-checkpoints
    --log-format simple
    --log-interval 1000 for i (tradeoff), I use 0.1, 0.15 and 0.2. The best result is got by 0.15. When you calculate BLEU score, do you use best checkpoint or average checkpoint (average how many epoch's checkpoints). Do you also have other suggestions?

BaohaoLiao avatar Dec 24 '19 19:12 BaohaoLiao

Hi,

I want to reproduce your result on IWSLT14 De-En, but I can't get 35.78. My best result is 34.25. Here I want to ask some detailed setup:

1. Do you use share embedding? I don't use. If yes, how about your size of vocabulary.

2. For language model, I use
   python ~/fairseq/train.py 
   ~/de2en/lmofde 
   --task language_modeling 
   --arch transformer_lm_iwslt 
   --optimizer adam 
   --adam-betas '(0.9, 0.98)' 
   --clip-norm 0.0 
   --lr-scheduler inverse_sqrt 
   --warmup-init-lr 1e-07 
   --warmup-updates 4000 
   --lr 0.0005 
   --min-lr 1e-09 
   --dropout 0.1 
   --weight-decay 0.0 
   --criterion label_smoothed_cross_entropy 
   --label-smoothing 0.1 
   --max-tokens 4096  
   --tokens-per-sample 4096  
   --save-dir $dir 
   --update-freq 16 
   --no-epoch-checkpoints 
   --log-format simple 
   --log-interval 1000 
   for both De and En language model. I train the language model until convergence and use the best checkpoint for NMT. Do you have suggestion for my settings?

3. For NMT, I use
   python ~/SCA/train.py 
   $DATA_PATH 
   --task lm_translation 
   --arch transformer_iwslt_de_en 
   --optimizer adam 
   --adam-betas '(0.9, 0.98)' 
   --clip-norm 0.0 
   --lr-scheduler inverse_sqrt 
   --warmup-init-lr 1e-07 
   --warmup-updates 4000 
   --lr 0.0009 
   --min-lr 1e-09 
   --dropout 0.3 
   --weight-decay 0.0 
   --criterion label_smoothed_cross_entropy 
   --label-smoothing 0.1 
   --max-tokens 2048 
   --update-freq  2 
   --save-dir $SAVE_DIR 
   --tradeoff $i 
   --load-lm 
   --seed 200 
   --no-epoch-checkpoints 
   --log-format simple 
   --log-interval 1000
   for i (tradeoff),  I use 0.1, 0.15 and 0.2. The best result is got by 0.15. When you calculate BLEU score, do you use best checkpoint or average checkpoint (average how many epoch's checkpoints). Do you also have other suggestions?

By the way, I use 1 GPU. How many GPUs do you use for IWSLT14 DE-EN and WMT14 EN-DE, respectively? I need to make sure we use the same batch size by setting update-freq.

BaohaoLiao avatar Dec 24 '19 20:12 BaohaoLiao

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details).

I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme.

I just use one gpu for iwslt and 4 gpu for wmt.

I did not do average checkpoint.

teslacool avatar Dec 25 '19 01:12 teslacool

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details).

I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme.

I just use one gpu for iwslt and 4 gpu for wmt.

I did not do average checkpoint.

I can reproduce the result now. Thank you very much.

BaohaoLiao avatar Dec 29 '19 20:12 BaohaoLiao

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details). I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme. I just use one gpu for iwslt and 4 gpu for wmt. I did not do average checkpoint.

I can reproduce the result now. Thank you very much.

May I know how long does it cost to train the LM on 1 gpu? And to train the NMT on 4 gpus? Thank you !

1024er avatar Mar 01 '20 06:03 1024er