flash-linear-attention
flash-linear-attention copied to clipboard
Lack of speed advantage in GLA training
I have compared the speeds of GLA, Attention, and Flash Attention, as shown in the table below, and found that GLA has little to no advantage in terms of speed. What could be the reasons behind this result?
| seq_len | attention | flash attn 2 | fused_chunk_gla | chunk_gla | fused_recurrent_gla |
|---|---|---|---|---|---|
| 1163 | 0.000822s | 0.000177s | 0.00200s | 0.00138s | 0.000860s |
| 1172 | 0.000769s | 0.000192s | 0.00197s | 0.00138s | 0.000851s |
| 1346 | 0.000782s | 0.000185s | 0.00186s | 0.00143s | 0.000870s |
| 1366 | 0.000827s | 0.000154s | 0.00183s | 0.00144s | 0.000872s |
Environment:
NVIDIA GeForce RTX 3090
Driver Version: 525.89.02
CUDA Version: 11.8
torch 2.0.1
accelerate 0.21.0
transformers 4.31.0
triton 2.2.0