snowfall
snowfall copied to clipboard
Nonlinearity before output (+reminder to test conformer)
Guys, I think the fact that the model output dim (87) is less than the feedforward dim (256) leads to some wastage of parameters, especially for layers close to the end, as there are some directions in the space that won't affect the output; and this will affect the training of layers close to the end as well. I think it would be better to put it through a nonlinearity, e.g. relu + linear before final log-softmax. Would be nice if someone could try this out. (Could also try normalization before the relu+softmax, but not really necessary I think because it has already been normalized inside the attention layer.) Either relu+softmax, or linear+relu+normalize+linear+softmax at the end. Dan
.. I am thinking of the attention system here.
I could try this. Do you mean TransformerEncoder + relu + linear + logsoftmax or TransformerEncoder + relu + normalize + linear + logsoftmax ?
I think the first one; the normalize probably is not necessary.
On Wed, Feb 17, 2021 at 9:05 PM Han Zhu [email protected] wrote:
I could try this. Do you mean TransformerEncoder + relu + linear + logsoftmax or TransformerEncoder + relu + normalize + linear + logsoftmax ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/102#issuecomment-780543042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2YQYM5MEPUTWAEPG3S7O5KRANCNFSM4XYFIX4Q .
With nonlinearity (relu) before output layer, MMI model performs better, but CTC model performs worse.
| Model | MMI | CTC |
|---|---|---|
| Original | %WER 8.05% [4230 / 52576, 610 ins, 367 del, 3253 sub ] | %WER 8.15% [4287 / 52576, 482 ins, 549 del, 3256 sub ] |
| Original + Nonlinearity | %WER 7.79% [4098 / 52576, 599 ins, 341 del, 3158 sub ] | %WER 8.68% [4561 / 52576, 489 ins, 613 del, 3459 sub ] |
mm. Can you try linear -> relu -> batchnorm -> linear -> logsoftmax, with, say, 2048 as the intermediate dim?
OK, will try it!
Tried TransformerEncoder + linear + relu + normalize + linear + logsoftmax. BatchNorm or LayerNorm are used for normalization. These models perform worse than those with only final nonlinearity, i.e., TransformerEncoder + relu + linear + logsoftmax.
| Model | MMI | CTC |
|---|---|---|
| Original + Nonlinearity + BatchNorm | %WER 7.91% [4160 / 52576, 586 ins, 384 del, 3190 sub ] | %WER 9.60% [5045 / 52576, 510 ins, 773 del, 3762 sub ] ] |
| Original + Nonlinearity + LayerNorm | %WER 7.83% [4117 / 52576, 582 ins, 373 del, 3162 sub ] | %WER 9.11% [4790 / 52576, 470 ins, 772 del, 3548 sub ] ] |
OK, too bad. I suppose we should leave it as-is then. @zhu-han how much better is conformer, vs. transformer? I wonder if it might be helpful to implement that model.
Conformer outperforms transformer in many datasets according to https://arxiv.org/pdf/2010.13956.pdf. I think it's worth doing. Will implement it soon.