neural_sp
neural_sp copied to clipboard
what's the intention of mocha first layers?
https://github.com/hirofumi0810/neural_sp/blob/78fa843e7f9b27b93a57099104db49d481ff95bb/neural_sp/models/seq2seq/decoders/transformer.py#L190-L194
hey, I notice there'll be mocha_first_layer - 1
transformer blocks without encoder-decoder attention, what's the intention of it? And I copied the neural_sp transformer decoder into espnet, training will not converge if mocha_first_layer
set to 4(config from librispeech), but it will be much better if I set mocha_first_layer
to 0.
@Cescfangs Did you train the streaming decoder? or the normal Transformer decoder?
streaming decoder
Can you show me the encoder-decoder attention plot?
OK, but this will take sometime
Also, I'd like to know the performance when mocha_first_layer=1
.
So I train the MMA decoder with very small dataset(1000utts), the perfermance may not be good, but I think it's effective for debugging, here is the acc plot and the first layer encoder-decoder attention plot:
Espnet default non-stream transformer decoder:
MMA decoder, mocha_first_layer=0:
MMA decoder, mocha_first_layer=1:
MMA decoder, mocha_first_layer=4:
training failed after epoch 1 of inf gradient
I think mocha_first_layer=1
is identical to mocha_first_layer=0
, do you mean ·mocha_first_layer - 1
=1?
I think there are some mistakes in your implementation or you misunderstand something. As hard monotonic attention is not globally normalized over time indices, such vertical lines should not appear. If the model does not learn anything, attention weights should be all zero.
I see, so what's the intention behind mocha_first_layers
indeed?
In my implementation, the cross attention in lower layers is not learnt well. So I removed such attention to encourage the rest layers to learn alignments correctly. You can find more details in https://arxiv.org/abs/2005.09394.
Thank you, I'll re-check my implementation
I re-checked the plot_attention part, found it was not plotting the attention weight between encoder-decoder, the x-axis is actuall attention dim(256), I ll update attention plots later
here I give the last layer encoder-decoder attention plots since the mocha_first_layer - 1
layers have no src_att
:
Espnet default non-stream transformer decoder:
MMA decoder, mocha_first_layer=0(using mocha for all decoder layers):
MMA decoder, mocha_first_layer=2(first layer has no encoder-decoder att):
MMA decoder, mocha_first_layer=4(first 3 layers has no encoder-decoder att):
mocha_first_layer=4
break after 1 epoch,
My question is
- How many epochs did you run?
- What dataset did you use?
- Did you use the regularization method I proposed?
- What is the WER of the baseline offline Transformer?
Cross attention in the offline Transformer also seems weird.
- the plots above from very small dataset(1000utts) for 1 epoch(for fast debugging), the perfermance may not be good, but I think it's effective for debugging
- I also trained the mma-decoder(mocha_first_layer=0) for about 1000 hours data, and the final acc is competitive to offline version
- regularization method were not used yet
Please report your results once the model converges.
Note that accuracy does not necessarily transfer to the final WER in my experience.
Also, changing mocha_first_layer
only does not work.
I test my 1000h-data streaming model, the perfermance is bad because lower decoder layers make very confused alignments, now I get the idea of so called "attention heads pruning in lower layers"
1st decoder layer:
last decoder layer:
I'll try regularization tricks next
@Cescfangs Amazing! If you can reproduce the results, can you consider sending PR to espnet?
definitively