icefall Reworked version of CTC+attention model

Reworked version of CTC+attention model

Open danpovey opened this issue 2 years ago • 2 comments

It would be nice to have a "reworked" [as in https://github.com/k2-fsa/icefall/pull/288] version of the CTC+attention setup (in a separate directory). That would be a good reference for cases where we need a standard transformer with the changes I made in that dir; advantages should be mostly speed of training, hopefully slightly better results, esp. when data is very large; but also ability to train in half precision without NaN's. This would involve re-adding transformer.py, replacing nn.Linear units with ScaledLinear and nn.Embedding with ScaledEmbedding; removing most LayerNorms and replacing the rest with BasicNorm; and using the same optimizer I am using (well, taking most changes from train.py). Also probably replace the feedforward modules with the feedforward module used in conformer.py, use the same subsampling-embedding module as conformer.py, simplify the transformer forward function in transformer.py, remove pre_norm option and vgg_subsampling option, add model-warmup option. In fact it may end up being easier to just copy conformer.py to transformer.py and modify it a bit.

This is low priority though. Just creating the issue so we don't forget.

Apr 11 '22 07:04 danpovey

@csukuangfj @danpovey Do you think it's reasonable to add warmup for the decoder module in conformer? In Librispeech conformer_ctc2 recipe I see that the author added this option in the new version of the decoder, but left it to a default value of 1.0 during training in the decoder.

Sep 18 '22 20:09 videodanchik

Sure, I think it's worth a try to see whether it helps convergence at the start!

Sep 19 '22 03:09 danpovey

icefall icefall copied to clipboard

Reworked version of CTC+attention model

icefall
icefall copied to clipboard