transformer Linear transform with bias at multi-head attention

In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?

Thanks.

https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203

Sep 25 '18 07:09 roomylee

similar question here, it seems in the original paper that only W was used. no bias and no activation function. Wondering the design behind these 3 lines here.

Oct 10 '18 05:10 crystina-z

@roomylee I guess u r right. Look at this implementation by Keras, https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py u can find the definiation for QKV in line 54-64, no bias and activation are used.

Oct 18 '18 02:10 DaoD

I think that bias and activation are optional, no harm.

Oct 24 '18 07:10 sunnnnnnnny

u r right. In the latest code, there is: Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model) K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model) V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model)

Mar 16 '19 07:03 ty5491003

why there only project one time？Should project 8 time with different weight in paper？

Jun 03 '19 03:06 Vipning

In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?

Thanks.

https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203

Hi, for self attention, do you know why value need to be linear transformed? For self attention, query, key must be different, so they need be linear transformed.

Feb 20 '21 02:02 GuoYL36

In the recent implementation of tensorflow, using bias=False for multihead_attension. https://github.com/tensorflow/tensor2tensor/blob/53a1be68727b5d5c3a0d0bf18721013843a49041/tensor2tensor/layers/common_attention.py#L4415-L4417

so bias=False seems to be right.

Aug 16 '22 13:08 sadahry

Then, what is the difference whether we use bias or not?

Nov 03 '22 02:11 hellojialee

Thanks，I'll reply you soon

Nov 03 '22 02:11 bobobe

transformer transformer copied to clipboard

Linear transform with bias at multi-head attention

transformer
transformer copied to clipboard