Results 32 comments of Cesc

> @Cescfangs yes Thanks for the reply, and I'm curious about the improvement of this mWER tuning, say 5% relative wer reduction?

https://github.com/hirofumi0810/neural_sp/blob/2b10b9cc4bdecb5180ecc45575c0ef410fb09aa3/neural_sp/models/criterion.py#L12-L39 Also, I am a little confused about the “mbr” loss, the inputs are not used in backward function, how does the grad flow to model params?

OK, but this will take sometime

So I train the MMA decoder with very small dataset(1000utts), the perfermance may not be good, but I think it's effective for debugging, here is the acc plot and the...

I see, so what's the intention behind `mocha_first_layers` indeed?

Thank you, I'll re-check my implementation

I re-checked the plot_attention part, found it was not plotting the attention weight between encoder-decoder, the x-axis is actuall attention dim(256), I ll update attention plots later

here I give the last layer encoder-decoder attention plots since the `mocha_first_layer - 1` layers have no `src_att`: **Espnet default non-stream transformer decoder:** ![decoder decoders 5 src_attn 1ep](https://user-images.githubusercontent.com/11382612/84633240-ad550300-af22-11ea-88e3-be234d70f211.png) **MMA decoder,...

* the plots above from very small dataset(1000utts) for 1 epoch(for fast debugging), the perfermance may not be good, but I think it's effective for debugging * I also trained...