flash-linear-attention [RFC] Autotune should consider batch size and number of heads

[RFC] Autotune should consider batch size and number of heads

Open sustcsonglin opened this issue 10 months ago • 2 comments

Proposal

The optimal kernel configuration should adjust based on changes in (batch size × number of heads).

Rationale

The performance of the autotuned kernel can vary significantly when the product of (batch size × number of heads) changes, especially with different levels of parallelism determined by the batch and head dimensions.

Jan 11 '25 11:01 sustcsonglin

Autotuning should also take the total sequence length into account, as the sequence length dimension provides parallelism in addition to the number of heads and batch size.

Jan 14 '25 21:01 sustcsonglin

I would consider this issue, but since token length is still changing during training and reasoning, autotune for tokenlength is still worth considering

Feb 01 '25 11:02 zhiyuan1i

flash-linear-attention flash-linear-attention copied to clipboard

[RFC] Autotune should consider batch size and number of heads

Proposal

Rationale

flash-linear-attention
flash-linear-attention copied to clipboard