FastBERT
FastBERT copied to clipboard
I'm curious about the reason for making self-attention for each classifier layer.
First of all, thanks for your kind offer.
What do you think is the reason for self-attention for each classifier layer?
The paper also says that it does self-attention in 128 dimensions.
What do you think is the difference from deriving a result without self-attention with a only hidden size of 768 dimensions?