flash-attention Shall Flash-attn support Gemma-2 soft-capping anytime soon ?

Shall Flash-attn support Gemma-2 soft-capping anytime soon ?

Open thusinh1969 opened this issue 1 year ago • 6 comments

trafficstars

Great product, TriDao (quá giỏi bạn tôi).

Shall your Flash-attn support Gemma-2 soft-capping anytime soon ? We inspired much by Gemma-2 quality and hence we would stick to it should context-length can be expanded. Unsloth has some options but they support only single GPU.

Thanks much bạn, Cheers Nguyên

Jul 31 '24 03:07 thusinh1969

it's supported

Jul 31 '24 03:07 tridao

it's supported

Algorithm 1 FlashAttention-3 forward pass without intra-consumer overlapping Algorithm 2 FlashAttention-3 consumer warpgroup forward pass

Is the current implementation of the Algorithm 2? Is there an implementation of the Algorithm 1 available? I would like to make a comparison. Could you please provide the complete code of Algorithm 1? thanks!

Jul 31 '24 06:07 v-lmn

Hi @tridao, I recently installed the latest version of FlashAttention using the following command: pip install -U https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl I am using AutoModelForCausalLM from the Hugging Face Transformers library, which I've also upgraded to the latest version. However, I'm still noticing some unexpected results during inference. Before I dive into debugging other parts of my code, I want to confirm whether the version of FlashAttention I've installed is fully compatible with the latest version of Hugging Face Transformers. Could this version mismatch be causing the issues I'm seeing? Thanks for your help!

Jul 31 '24 08:07 HuangBugWei

Idk anything about HF transformers

Jul 31 '24 17:07 tridao

Thank you for your response. May I then assume that the version I installed supports the soft capping operation?

Jul 31 '24 22:07 HuangBugWei

Idk anything about HF transformers

@tridao Is the current implementation of the Algorithm 2? Is there an implementation of the Algorithm 1 available? I would like to make a comparison. Could you please provide the complete code of Algorithm 1? thanks!

Aug 01 '24 06:08 v-lmn

flash-attention flash-attention copied to clipboard

Shall Flash-attn support Gemma-2 soft-capping anytime soon ?

flash-attention
flash-attention copied to clipboard