snowfall Nonlinearity before output (+reminder to test conformer)

Guys, I think the fact that the model output dim (87) is less than the feedforward dim (256) leads to some wastage of parameters, especially for layers close to the end, as there are some directions in the space that won't affect the output; and this will affect the training of layers close to the end as well. I think it would be better to put it through a nonlinearity, e.g. relu + linear before final log-softmax. Would be nice if someone could try this out. (Could also try normalization before the relu+softmax, but not really necessary I think because it has already been normalized inside the attention layer.) Either relu+softmax, or linear+relu+normalize+linear+softmax at the end. Dan

Feb 17 '21 11:02 danpovey

.. I am thinking of the attention system here.

Feb 17 '21 11:02 danpovey

I could try this. Do you mean TransformerEncoder + relu + linear + logsoftmax or TransformerEncoder + relu + normalize + linear + logsoftmax ?

Feb 17 '21 13:02 zhu-han

I think the first one; the normalize probably is not necessary.

On Wed, Feb 17, 2021 at 9:05 PM Han Zhu [email protected] wrote:

I could try this. Do you mean TransformerEncoder + relu + linear + logsoftmax or TransformerEncoder + relu + normalize + linear + logsoftmax ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/102#issuecomment-780543042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2YQYM5MEPUTWAEPG3S7O5KRANCNFSM4XYFIX4Q .

Feb 17 '21 14:02 danpovey

With nonlinearity (relu) before output layer, MMI model performs better, but CTC model performs worse.

Model	MMI	CTC
Original	%WER 8.05% [4230 / 52576, 610 ins, 367 del, 3253 sub ]	%WER 8.15% [4287 / 52576, 482 ins, 549 del, 3256 sub ]
Original + Nonlinearity	%WER 7.79% [4098 / 52576, 599 ins, 341 del, 3158 sub ]	%WER 8.68% [4561 / 52576, 489 ins, 613 del, 3459 sub ]

Feb 18 '21 09:02 zhu-han

mm. Can you try linear -> relu -> batchnorm -> linear -> logsoftmax, with, say, 2048 as the intermediate dim?

Feb 18 '21 13:02 danpovey

OK, will try it!

Feb 18 '21 13:02 zhu-han

Tried TransformerEncoder + linear + relu + normalize + linear + logsoftmax. BatchNorm or LayerNorm are used for normalization. These models perform worse than those with only final nonlinearity, i.e., TransformerEncoder + relu + linear + logsoftmax.

Model	MMI	CTC
Original + Nonlinearity + BatchNorm	%WER 7.91% [4160 / 52576, 586 ins, 384 del, 3190 sub ]	%WER 9.60% [5045 / 52576, 510 ins, 773 del, 3762 sub ] ]
Original + Nonlinearity + LayerNorm	%WER 7.83% [4117 / 52576, 582 ins, 373 del, 3162 sub ]	%WER 9.11% [4790 / 52576, 470 ins, 772 del, 3548 sub ] ]

Feb 20 '21 04:02 zhu-han

OK, too bad. I suppose we should leave it as-is then. @zhu-han how much better is conformer, vs. transformer? I wonder if it might be helpful to implement that model.

Feb 20 '21 08:02 danpovey

Conformer outperforms transformer in many datasets according to https://arxiv.org/pdf/2010.13956.pdf. I think it's worth doing. Will implement it soon.

Feb 20 '21 09:02 zhu-han

snowfall snowfall copied to clipboard

Nonlinearity before output (+reminder to test conformer)

snowfall
snowfall copied to clipboard