about post normalization

Open bityigoss opened this issue 1 year ago • 4 comments

I have added post layer-normalization option to OpenNMT, but the translation BLEU score is much lower than pre layer-normalization (in OpenNMT, default is pre-norm, and my training model architecture is 6-6 layers transformer-base model), but currently I don't see a problem with the implementation of post-norm. here I want to ask, why there is no post-norm option in OpenNMT-py's implementation? Does post-norm have convergence issues in opennmt-py？

Aug 30 '22 09:08 bityigoss

There should not be a convergence issue. Maybe the best is to submit a PR and we can have a look at your code. PS: just after the Transformer paper, tensor2tensor released their code with norm instead of prenorm as in the paper.

Sep 12 '22 14:09 vince62s

@vince62s Thanks for your reply, here is my code for post-normalization #2199

Sep 14 '22 09:09 bityigoss

at first sight looks good. can you give more info on your training / results ?

Sep 15 '22 13:09 vince62s

@vince62s thank you, I tested on IWSLT14-DE-EN dataset, and my training config is like(skiped the cropus and vocab opts parts ): if I set normalize_after option to true, the BLEU score is 11.7 and if I don't set the post-norm option, the BLEU score is 37.0

General opts

save_model: checkpoints/iwslt14.de-en_postnorm/ckpt keep_checkpoint: 20 save_checkpoint_steps: 1000 average_decay: 0.0005 seed: 42 report_every: 100 train_steps: 15000 valid_steps: 1000 model_task: seq2seq

Batching

queue_size: 1024 bucket_size: 180000 pool_factor: 8192 world_size: 1 gpu_ranks: [0] batch_type: "tokens" batch_size: 4096 valid_batch_size: 256 batch_size_multiple: 1 max_generator_batches: 0 accum_count: [8] accum_steps: [0]

Optimization

model_dtype: "fp16" optim: "adam" learning_rate: 2 warmup_steps: 6000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens"

Model

encoder_type: transformer decoder_type: transformer enc_layers: 6 dec_layers: 6 heads: 8 normalize_after: true rnn_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps: [0] dropout: [0.3] attention_dropout: [0.1] share_decoder_embeddings: true share_embeddings: true position_encoding: true`

Sep 16 '22 06:09 bityigoss

OpenNMT-py OpenNMT-py copied to clipboard

about post normalization

General opts

Batching

Optimization

Model

OpenNMT-py
OpenNMT-py copied to clipboard