FastBERT I'm curious about the reason for making self-attention for each classifier layer.

I'm curious about the reason for making self-attention for each classifier layer.

Open wonbeeny opened this issue 3 years ago • 0 comments

First of all, thanks for your kind offer.

What do you think is the reason for self-attention for each classifier layer?

The paper also says that it does self-attention in 128 dimensions.

What do you think is the difference from deriving a result without self-attention with a only hidden size of 768 dimensions?

Mar 24 '21 03:03 wonbeeny