cosFormer
cosFormer copied to clipboard
Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention
We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch. Is there a way to incorporate this...
When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net...
original code makes int type + function type
In the paper,it mentioned that the work of the bidirectional language modeling pre-train has been done. Are you planning on releasing some pre-trained weights for the model?
Compared with `left_product` function, attention mask is not used in `forward()` function. How to use the attention mask in the forward method?