SATRN Confused by self_attention

Confused by self_attention_bias

Open jjjjohnson opened this issue 4 years ago • 3 comments

Hello there During inference, the decode self attention bias should be 1. but it is self_attention_bias[:, :, i:i + 1, :i + 1], which has shape (1,1,1,i+1). Could you explain why? Thanks!

Aug 19 '20 12:08 jjjjohnson

During inference, the attention map for decoder_inputs is [batch_size, num_heads, 1, 1]. Why do we need self attention bias any way?

Aug 19 '20 12:08 jjjjohnson

Hi @jjjjohnson, Thanks for the comment.

I think your comment is correct. There is no need to add self-attention bias "during inference", so self_attention_bias should be 1. I think self_attention_bias[:, :, i:i + 1, :i + 1] is a matrix of all 1 values. Perhaps it was implemented as above to implement similar to the training phase(transformer_decoder).

Thank you.

Aug 19 '20 14:08 JunYeopLee

I think self_attention_bias is a triangular matrix with shape [1, 1, max_len, max_len]. The upper right half is -1e9 and lower left half is 0. So self_attention_bias[:, :, i:i + 1, :i + 1] is a matrix of all 0 values.

Aug 20 '20 06:08 jjjjohnson

SATRN SATRN copied to clipboard

Confused by self_attention_bias

SATRN
SATRN copied to clipboard