Tri Dao comments

Results 429 comments of


                                            Tri Dao

trafficstars

flash_attetion doesn't work on windows+wsl RTX 5090

Can you check that this line is executed in setup.py? It sets the compiler flag to compile for 5090 etc https://github.com/Dao-AILab/flash-attention/blob/2f9ef0879a0935c3ca852f7a6a7b7a9c24f41e96/setup.py#L190

flash_attetion doesn't work on windows+wsl RTX 5090

Right, you need nvcc version >= 12.8 to compile for 5090.

Inappropriate Number of Splits Predicted by determine_num_splits function in flash_attn_with_kvcache ( Non-paged kv cache )

It's just a heuristic to determine num_splits. In this case it doesn't work super well. We can't use the info in cache_seqlens since that's on GPU and doing a sync...

Inappropriate Number of Splits Predicted by determine_num_splits function in flash_attn_with_kvcache ( Non-paged kv cache )

Are you suggesting a different heuristic based on cache_max_seq_len? When would that be better / worse than the current heuristic?

Tri Dao

flash_attetion doesn't work on windows+wsl RTX 5090

flash_attetion doesn't work on windows+wsl RTX 5090

Inappropriate Number of Splits Predicted by determine_num_splits function in flash_attn_with_kvcache ( Non-paged kv cache )

Inappropriate Number of Splits Predicted by determine_num_splits function in flash_attn_with_kvcache ( Non-paged kv cache )

Variable memory allocation with varlen kernels

can we use this with project like XInference ?

[BUG][CuTe DSL] for loop is wrong if step is negative

Setup for repo unclear

Triton version is faster in both forward and backward when head dim is 64 but slower in both when head dim is 128

Triton version is faster in both forward and backward when head dim is 64 but slower in both when head dim is 128