tf-nlp-blocks Unneeded ReLU in Transformer-match

Unneeded ReLU in Transformer-match

Open urialon opened this issue 5 years ago • 2 comments

Hi, Thank you for sharing this great repository!

I think that the ReLU in the k/q/v projections is unneeded, or at least, inconsistent with the Transformer paper: https://github.com/hanxiao/tf-nlp-blocks/blob/master/nlp/match_blocks.py?fbclid=IwAR0HdDwaSpPZbhwi6BfzWY0P0ZIMOWAobZU5aPtvpZO2d3MTOEXrwkQzY5A#L113 This computation in Google's tensor2tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py#L3732

paper: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

What do you think? Thanks!

Mar 12 '19 19:03 urialon

Dear @urialon, I think it doesn't inconsistent with "attention_layer" implemented in "bert/modeling.py". There you can set query_act/key_act/value_act to ReLU, GeLU, or None. I think it doesn't matter to do non-linear projection or linear projection.

Mar 20 '19 02:03 libertatis

@libertatis I actually agree with @urialon, the default settings (notice we can't even choose the activation function) should be true to the Transformer described in the paper, which uses linear projections.

Also I believe the authors of BERT did test different settings with linear and non linear projections and finalized on linear for a reason

Apr 02 '19 08:04 eliorc

tf-nlp-blocks tf-nlp-blocks copied to clipboard

Unneeded ReLU in Transformer-match

tf-nlp-blocks
tf-nlp-blocks copied to clipboard