flash-linear-attention icon indicating copy to clipboard operation
flash-linear-attention copied to clipboard

Lack of speed advantage in GLA training

Open Yingyue-L opened this issue 1 year ago • 3 comments

I have compared the speeds of GLA, Attention, and Flash Attention, as shown in the table below, and found that GLA has little to no advantage in terms of speed. What could be the reasons behind this result?

seq_len attention flash attn 2 fused_chunk_gla chunk_gla fused_recurrent_gla
1163 0.000822s 0.000177s 0.00200s 0.00138s 0.000860s
1172 0.000769s 0.000192s 0.00197s 0.00138s 0.000851s
1346 0.000782s 0.000185s 0.00186s 0.00143s 0.000870s
1366 0.000827s 0.000154s 0.00183s 0.00144s 0.000872s

Environment:

NVIDIA GeForce RTX 3090
Driver Version: 525.89.02      
CUDA Version: 11.8
torch                    2.0.1
accelerate               0.21.0
transformers             4.31.0
triton                   2.2.0

Yingyue-L avatar Jul 17 '24 04:07 Yingyue-L