flash-linear-attention
flash-linear-attention copied to clipboard
[RFC] Fuse shortconv and output norm/gate into the kernel
Proposal
fuse shortconv and output norm/gate into kernels, as in Mamba1 and Mamba2
Rationale
QKV ShortConv will introduce three additional activations, resulting in a non-negligible memory overhead.