Tri Dao

Results 639 comments of Tri Dao

Yup seems to check out with my calculation.

Why not just convert the HF weights to use the Llama implementation in this repo? https://github.com/Dao-AILab/flash-attention/blob/a86442f0f35c135c8ed8d7af760b1bd6a832ec07/tests/models/test_llama.py#L65 You can also see how we use rotary in MHA [here](https://github.com/Dao-AILab/flash-attention/blob/a86442f0f35c135c8ed8d7af760b1bd6a832ec07/flash_attn/modules/mha.py#L670).

Yeah we should probably add those as arguments to the RotaryEmbedding module (scaling factor, scaling_method="ntk" or scaling_method="standard"). Is "standard" the right name or is there another name?

Thanks for this thorough investigation! I didn't realize RPATH was hard-coded. Is there a way to set that to be relative when we compile? Or is patching the only way?

Yes we're seeing the best performance on CUDA 12.3. There might be some fix for 12.5 (by better tuning) but we're not there yet.

Github runners now longer support ubuntu 20.04 so we had to compile the wheels with ubuntu 22.04. The right thing is to have the github runners compile with manylinux but...

Right, we'd appreciate help with configuring the github runner to use manylinux https://github.com/Dao-AILab/flash-attention/blob/main/.github/workflows/publish.yml

We're not actively working on Turing but there's a version here: https://github.com/Dao-AILab/flash-attention/issues/1533

What's "context length" here? Which variable?