SATRN
SATRN copied to clipboard
Confused by self_attention_bias
Hello there During inference, the decode self attention bias should be 1. but it is self_attention_bias[:, :, i:i + 1, :i + 1], which has shape (1,1,1,i+1). Could you explain why? Thanks!
During inference, the attention map for decoder_inputs is [batch_size, num_heads, 1, 1]. Why do we need self attention bias any way?
Hi @jjjjohnson, Thanks for the comment.
I think your comment is correct. There is no need to add self-attention bias "during inference", so self_attention_bias should be 1. I think self_attention_bias[:, :, i:i + 1, :i + 1] is a matrix of all 1 values. Perhaps it was implemented as above to implement similar to the training phase(transformer_decoder).
Thank you.
I think self_attention_bias is a triangular matrix with shape [1, 1, max_len, max_len]. The upper right half is -1e9 and lower left half is 0. So self_attention_bias[:, :, i:i + 1, :i + 1] is a matrix of all 0 values.