cosFormer issues

Attn Mask for Non-causal Models

2

We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch. Is there a way to incorporate this...

roshansh-cmu

Why cosformer not work on XL-base transformer architecture?

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net...

lwaekfjlk

fix elu call bug

original code makes int type + function type

lwaekfjlk

Pre-train model

In the paper，it mentioned that the work of the bidirectional language modeling pre-train has been done. Are you planning on releasing some pre-trained weights for the model?

csorujian

Why the attn mask is not used in forward function?

1

Compared with `left_product` function, attention mask is not used in `forward()` function. How to use the attention mask in the forward method?

HanielF

cosFormer
cosFormer copied to clipboard

Metadata

Attn Mask for Non-causal Models

Why cosformer not work on XL-base transformer architecture?

fix elu call bug

Pre-train model

Why the attn mask is not used in forward function?

← Metadata

Owner

Metadata

cosFormer cosFormer copied to clipboard

Metadata

Attn Mask for Non-causal Models

Why cosformer not work on XL-base transformer architecture?

fix elu call bug

Pre-train model

Why the attn mask is not used in forward function?

← Metadata

Owner

Metadata

cosFormer
cosFormer copied to clipboard