tf-nlp-blocks
tf-nlp-blocks copied to clipboard
Unneeded ReLU in Transformer-match
Hi, Thank you for sharing this great repository!
I think that the ReLU in the k/q/v projections is unneeded, or at least, inconsistent with the Transformer paper: https://github.com/hanxiao/tf-nlp-blocks/blob/master/nlp/match_blocks.py?fbclid=IwAR0HdDwaSpPZbhwi6BfzWY0P0ZIMOWAobZU5aPtvpZO2d3MTOEXrwkQzY5A#L113 This computation in Google's tensor2tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py#L3732
paper: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
What do you think? Thanks!
Dear @urialon, I think it doesn't inconsistent with "attention_layer" implemented in "bert/modeling.py". There you can set query_act/key_act/value_act to ReLU, GeLU, or None. I think it doesn't matter to do non-linear projection or linear projection.
@libertatis I actually agree with @urialon, the default settings (notice we can't even choose the activation function) should be true to the Transformer described in the paper, which uses linear projections.
Also I believe the authors of BERT did test different settings with linear and non linear projections and finalized on linear for a reason