neural_sp
neural_sp copied to clipboard
Question on masking in transformer encoder
Hello Mr. Hirofumi, thanks for opening source this excellent repository. Here I have some questions about the masking mechanism in transformer encoder:
In the code here
https://github.com/hirofumi0810/neural_sp/blob/3be0bac8a1b009ee36f10ca901f4c64160a5ce45/neural_sp/models/seq2seq/encoders/transformer.py#L394
there is a note says: # NOTE: no mask to avoid masking all frames in a chunk
. If you would be so kind, could you explain a little bit the reason why avoiding masking all frames in a chunk? Isn't it ok if all frames in a chunk are masked? In which case, I think both the masks and encoder output could be correct after reshape from chunkwise shapes to their normal shapes.
Another question is how do you deal with overlapped chunks when streaming_type=='mask'?
All frames in the tail chunk in some utterances might be completely masked out. That's why I did set xx_mask
to None. But I'm trying to implement the strict masking in the tail chunk now.
Regarding 'mask' option, the context size will be accumulated according to the depth. 'reshape' option avoids this by trading memory consumption.
Thank you. Will it make any differences on parameter updating or final performance result if all frames of some padded tail chunks are masked out?
@t13m I checked it and found that the explicit tail masking did not change the performance.
Hi @hirofumi0810 I have found that transformer_enc_pe_type 'add' is used with a lc_type 'reshape' in the mma example 'lc_transformer_mma_hie_subsample8_ma4H_ca4H_w16_from4L_64_64_32.yaml'. In this config, different chunks should have same position encoding. Don't you think it will degrade the performance?
@SoonSYJ I remember I tried different indices in each chunk, but it was not helpful. So I simply reuse the same indices in each chunk. Note that such positional encoding is still helpful on AISHELL-1.