transformer icon indicating copy to clipboard operation
transformer copied to clipboard

Linear transform with bias at multi-head attention

Open roomylee opened this issue 6 years ago • 9 comments

In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?

Thanks.

https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203

roomylee avatar Sep 25 '18 07:09 roomylee

similar question here, it seems in the original paper that only W was used. no bias and no activation function. Wondering the design behind these 3 lines here.

crystina-z avatar Oct 10 '18 05:10 crystina-z

@roomylee I guess u r right. Look at this implementation by Keras, https://github.com/Lsdefine/attention-is-all-you-need-keras/blob/master/transformer.py u can find the definiation for QKV in line 54-64, no bias and activation are used.

DaoD avatar Oct 18 '18 02:10 DaoD

I think that bias and activation are optional, no harm.

sunnnnnnnny avatar Oct 24 '18 07:10 sunnnnnnnny

u r right. In the latest code, there is: Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model) K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model) V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model)

ty5491003 avatar Mar 16 '19 07:03 ty5491003

why there only project one time?Should project 8 time with different weight in paper?

Vipning avatar Jun 03 '19 03:06 Vipning

In the paper, Attention is All You Need, query, key, value are linear transformed without bias at the multi-head attention. However, the variables in your code are transformed with bias. Is there any reason for using bias? Or is there something I do not know...?

Thanks.

https://github.com/Kyubyong/transformer/blob/6672f93c57ee97412a92989ed075d65244400227/modules.py#L201-L203

Hi, for self attention, do you know why value need to be linear transformed? For self attention, query, key must be different, so they need be linear transformed.

GuoYL36 avatar Feb 20 '21 02:02 GuoYL36

In the recent implementation of tensorflow, using bias=False for multihead_attension. https://github.com/tensorflow/tensor2tensor/blob/53a1be68727b5d5c3a0d0bf18721013843a49041/tensor2tensor/layers/common_attention.py#L4415-L4417

so bias=False seems to be right.

sadahry avatar Aug 16 '22 13:08 sadahry

Then, what is the difference whether we use bias or not?

hellojialee avatar Nov 03 '22 02:11 hellojialee

Thanks,I'll reply you soon

bobobe avatar Nov 03 '22 02:11 bobobe