neural_sp icon indicating copy to clipboard operation
neural_sp copied to clipboard

what's the intention of mocha first layers?

Open Cescfangs opened this issue 4 years ago • 18 comments

https://github.com/hirofumi0810/neural_sp/blob/78fa843e7f9b27b93a57099104db49d481ff95bb/neural_sp/models/seq2seq/decoders/transformer.py#L190-L194

hey, I notice there'll be mocha_first_layer - 1 transformer blocks without encoder-decoder attention, what's the intention of it? And I copied the neural_sp transformer decoder into espnet, training will not converge if mocha_first_layer set to 4(config from librispeech), but it will be much better if I set mocha_first_layer to 0.

Cescfangs avatar Jun 11 '20 08:06 Cescfangs

@Cescfangs Did you train the streaming decoder? or the normal Transformer decoder?

hirofumi0810 avatar Jun 11 '20 08:06 hirofumi0810

streaming decoder

Cescfangs avatar Jun 11 '20 08:06 Cescfangs

Can you show me the encoder-decoder attention plot?

hirofumi0810 avatar Jun 11 '20 08:06 hirofumi0810

OK, but this will take sometime

Cescfangs avatar Jun 11 '20 08:06 Cescfangs

Also, I'd like to know the performance when mocha_first_layer=1.

hirofumi0810 avatar Jun 11 '20 08:06 hirofumi0810

So I train the MMA decoder with very small dataset(1000utts), the perfermance may not be good, but I think it's effective for debugging, here is the acc plot and the first layer encoder-decoder attention plot: Espnet default non-stream transformer decoder: acc decoder decoders 0 src_attn

MMA decoder, mocha_first_layer=0: acc decoder layers 0 src_attn

MMA decoder, mocha_first_layer=1: acc decoder layers 0 src_attn

MMA decoder, mocha_first_layer=4: acc training failed after epoch 1 of inf gradient

I think mocha_first_layer=1 is identical to mocha_first_layer=0 , do you mean ·mocha_first_layer - 1 =1?

Cescfangs avatar Jun 12 '20 02:06 Cescfangs

I think there are some mistakes in your implementation or you misunderstand something. As hard monotonic attention is not globally normalized over time indices, such vertical lines should not appear. If the model does not learn anything, attention weights should be all zero.

hirofumi0810 avatar Jun 12 '20 02:06 hirofumi0810

I see, so what's the intention behind mocha_first_layers indeed?

Cescfangs avatar Jun 12 '20 02:06 Cescfangs

In my implementation, the cross attention in lower layers is not learnt well. So I removed such attention to encourage the rest layers to learn alignments correctly. You can find more details in https://arxiv.org/abs/2005.09394.

hirofumi0810 avatar Jun 12 '20 02:06 hirofumi0810

Thank you, I'll re-check my implementation

Cescfangs avatar Jun 12 '20 02:06 Cescfangs

I re-checked the plot_attention part, found it was not plotting the attention weight between encoder-decoder, the x-axis is actuall attention dim(256), I ll update attention plots later

Cescfangs avatar Jun 12 '20 02:06 Cescfangs

here I give the last layer encoder-decoder attention plots since the mocha_first_layer - 1 layers have no src_att: Espnet default non-stream transformer decoder: decoder decoders 5 src_attn 1ep

MMA decoder, mocha_first_layer=0(using mocha for all decoder layers): decoder layers 5 src_attn 1ep

MMA decoder, mocha_first_layer=2(first layer has no encoder-decoder att): decoder layers 5 src_attn 1ep

MMA decoder, mocha_first_layer=4(first 3 layers has no encoder-decoder att): decoder layers 5 src_attn 1ep

mocha_first_layer=4 break after 1 epoch,

Cescfangs avatar Jun 15 '20 08:06 Cescfangs

My question is

  • How many epochs did you run?
  • What dataset did you use?
  • Did you use the regularization method I proposed?
  • What is the WER of the baseline offline Transformer?

Cross attention in the offline Transformer also seems weird.

hirofumi0810 avatar Jun 15 '20 08:06 hirofumi0810

  • the plots above from very small dataset(1000utts) for 1 epoch(for fast debugging), the perfermance may not be good, but I think it's effective for debugging
  • I also trained the mma-decoder(mocha_first_layer=0) for about 1000 hours data, and the final acc is competitive to offline version
  • regularization method were not used yet

Cescfangs avatar Jun 15 '20 08:06 Cescfangs

Please report your results once the model converges. Note that accuracy does not necessarily transfer to the final WER in my experience. Also, changing mocha_first_layer only does not work.

hirofumi0810 avatar Jun 15 '20 08:06 hirofumi0810

I test my 1000h-data streaming model, the perfermance is bad because lower decoder layers make very confused alignments, now I get the idea of so called "attention heads pruning in lower layers"

1st decoder layer: decoder layers 0 src_attn 41ep last decoder layer: decoder layers 5 src_attn 41ep I'll try regularization tricks next

Cescfangs avatar Jun 15 '20 12:06 Cescfangs

@Cescfangs Amazing! If you can reproduce the results, can you consider sending PR to espnet?

hirofumi0810 avatar Jun 15 '20 12:06 hirofumi0810

definitively

Cescfangs avatar Jun 15 '20 12:06 Cescfangs