flash-linear-attention icon indicating copy to clipboard operation
flash-linear-attention copied to clipboard

[RFC] Fuse shortconv and output norm/gate into the kernel

Open sustcsonglin opened this issue 9 months ago • 0 comments

Proposal

fuse shortconv and output norm/gate into kernels, as in Mamba1 and Mamba2

Rationale

QKV ShortConv will introduce three additional activations, resulting in a non-negligible memory overhead.

sustcsonglin avatar Jan 24 '25 03:01 sustcsonglin