Tri Dao comments

Results 639 comments of


                                            Tri Dao

comparing HF vs FA2 llama2 models

Yup seems to check out with my calculation.

comparing HF vs FA2 llama2 models

Why not just convert the HF weights to use the Llama implementation in this repo? https://github.com/Dao-AILab/flash-attention/blob/a86442f0f35c135c8ed8d7af760b1bd6a832ec07/tests/models/test_llama.py#L65 You can also see how we use rotary in MHA [here](https://github.com/Dao-AILab/flash-attention/blob/a86442f0f35c135c8ed8d7af760b1bd6a832ec07/flash_attn/modules/mha.py#L670).

comparing HF vs FA2 llama2 models

Yeah we should probably add those as arguments to the RotaryEmbedding module (scaling factor, scaling_method="ntk" or scaling_method="standard"). Is "standard" the right name or is there another name?

Bad `RPATH` in pre-compiled Linux wheels

Thanks for this thorough investigation! I didn't realize RPATH was hard-coded. Is there a way to set that to be relative when we compile? Or is patching the only way?

CUDA versions > 12.3 do not correctly compile H100 Flash Attention 3

Yes we're seeing the best performance on CUDA 12.3. There might be some fix for 12.5 (by better tuning) but we're not there yet.

ImportError: /lib64/libc.so.6: version `GLIBC_2.32' not found

Github runners now longer support ubuntu 20.04 so we had to compile the wheels with ubuntu 22.04. The right thing is to have the github runners compile with manylinux but...

ImportError: /lib64/libc.so.6: version `GLIBC_2.32' not found

Right, we'd appreciate help with configuring the github runner to use manylinux https://github.com/Dao-AILab/flash-attention/blob/main/.github/workflows/publish.yml

Inquiry about Turing GPU support progress in FlashAttention-2

We're not actively working on Turing but there's a version here: https://github.com/Dao-AILab/flash-attention/issues/1533

Inquiry about Turing GPU support progress in FlashAttention-2

No we're not working on Turing.

Correctness of `flash_attn_varlen_func` kernel with cuda graph.

What's "context length" here? Which variable?