transformer
transformer copied to clipboard
Linear transform with bias at multi-head attention
In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?
Thanks.
https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203
similar question here, it seems in the original paper that only W was used. no bias and no activation function. Wondering the design behind these 3 lines here.
@roomylee I guess u r right. Look at this implementation by Keras, https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py u can find the definiation for QKV in line 54-64, no bias and activation are used.
I think that bias and activation are optional, no harm.
u r right. In the latest code, there is:
Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model)
K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model)
V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model)
why there only project one time?Should project 8 time with different weight in paper?
In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?
Thanks.
https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203
Hi, for self attention, do you know why value need to be linear transformed? For self attention, query, key must be different, so they need be linear transformed.
In the recent implementation of tensorflow, using bias=False
for multihead_attension.
https://github.com/tensorflow/tensor2tensor/blob/53a1be68727b5d5c3a0d0bf18721013843a49041/tensor2tensor/layers/common_attention.py#L4415-L4417
so bias=False
seems to be right.
Then, what is the difference whether we use bias or not?
Thanks,I'll reply you soon