Tri Dao comments

Results 482 comments of


                                            Tri Dao

Question about the paper v2: How to parallelize along the sequence length ?

The improvement in the backward pass is a combination of factors: - Not using split-k, so we reduce amount of shared memory needed and shared memory read/write. - Better work...

IMA with split k kernel.

Thanks for the bug report. I can reproduce the error now.

Type of gemm.

yes q@k^T is in fp32, softmax is done in fp32, then converted to bf16 to do the gemm with V.

error: command '/opt/conda/bin/nvcc' failed with exit code 255

Can you try with gcc 10?

error: command '/opt/conda/bin/nvcc' failed with exit code 255

What's your `nvcc` version?

error: command '/opt/conda/bin/nvcc' failed with exit code 255

All look reasonable, I've no idea why it fails. We recommend the [Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container from Nvidia, which has all the required tools to install FlashAttention.

FA3 varlen_bwd hangs (FA2 works in the same case)

Which cuda version are you using?

FA3 varlen_bwd hangs (FA2 works in the same case)

> [@tridao](https://github.com/tridao) torch:, 2.4.0+cu124 nvcc:, V12.4.131 You should try the latest version. It works fine for me. Btw your sequence lengths aren't right since x has `seqlens*2-1` but you `cu_seqlens_kv`...

FA3 varlen_bwd hangs (FA2 works in the same case)

> varlen hang happening to me too on `flashattn-hopper==3.0.0b1` (the wheel distributed alongside `flash_attn==2.7.2.post1`). > > using CUDA 12.8, H100, pytorch 2.6.0, driver 535.216.01. > > if seqused is None,...