Tri Dao comments

Results 250 comments of


                                            Tri Dao

flash decoding algorithm numerical error

Can you give a short script showing the numerical error?

Any plans to support tree attention mask?

Sure, we'll just need someone to contribute :D

flash-attention imported, not running

2080 (Turing) is not supported in the latest version.

Add support for small page sizes

Thanks so much for your work @skrider. Can you rebase and then I'll merge?

flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

Yep, we have new wheels compiled for pytorch 2.3.0

flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

Wheels are built for torch 2.2.2 and torch 2.3.0. Looks like it's not compatible with 2.2.0. You can try previous version of flash-attn, or build from scratch.

Turning support?

Unfortunately I haven't had much bandwidth.

Turning support?

Turing cards have less shared memory (64KB instead of 99KB or 163KB on Ampere) so that might require adjusting the block sizes currently used.

Why does `nvidia-cuda-runtime-cu12` not work and must have `/usr/local/cuda` version greater than 11.6

`make sure nvcc has a supported version by running nvcc -V`

`2.5.0` has an issue accessing memory illegally during backward

Can you save the tensors being passed to flash_attn_cuda.varlen_bwd and send them to me? Otherwise it would be very hard to debug? And can you print out the value of...