Tri Dao
Tri Dao
Can you check that this line is executed in setup.py? It sets the compiler flag to compile for 5090 etc https://github.com/Dao-AILab/flash-attention/blob/2f9ef0879a0935c3ca852f7a6a7b7a9c24f41e96/setup.py#L190
Right, you need nvcc version >= 12.8 to compile for 5090.
It's just a heuristic to determine num_splits. In this case it doesn't work super well. We can't use the info in cache_seqlens since that's on GPU and doing a sync...
Are you suggesting a different heuristic based on cache_max_seq_len? When would that be better / worse than the current heuristic?
There's a PR for that, will be merged soon.
I'm not familiar with XInference
Great, thank you so much! This bug is fixed in 4.1.
Thanks for the great suggestion. We've been pretty busy with a conference deadline but after this week we'd have more time.
On A100 or H100? If H100 then the Triton version uses new instructions on H100 but FA2 doesn't. You should try FA3 if you're on H100.
What speed (TFLOPS) do you get?