flash-linear-attention Lack of speed advantage in GLA training

Lack of speed advantage in GLA training

Open Yingyue-L opened this issue 1 year ago • 3 comments

I have compared the speeds of GLA, Attention, and Flash Attention, as shown in the table below, and found that GLA has little to no advantage in terms of speed. What could be the reasons behind this result?

seq_len	attention	flash attn 2	fused_chunk_gla	chunk_gla	fused_recurrent_gla
1163	0.000822s	0.000177s	0.00200s	0.00138s	0.000860s
1172	0.000769s	0.000192s	0.00197s	0.00138s	0.000851s
1346	0.000782s	0.000185s	0.00186s	0.00143s	0.000870s
1366	0.000827s	0.000154s	0.00183s	0.00144s	0.000872s

Environment:

NVIDIA GeForce RTX 3090
Driver Version: 525.89.02      
CUDA Version: 11.8
torch                    2.0.1
accelerate               0.21.0
transformers             4.31.0
triton                   2.2.0

Jul 17 '24 04:07 Yingyue-L

flash-linear-attention flash-linear-attention copied to clipboard

Lack of speed advantage in GLA training

flash-linear-attention
flash-linear-attention copied to clipboard