flash-attention add support for AMD / ROCm / HIP

I want to again request AMD support, since it is now much more popular and usable than it has been

Dec 06 '23 23:12 ehartford

AMD is working on it: https://github.com/ROCmSoftwarePlatform/flash-attention

I've not tested it yet, but a new branch with WMMA optimizations for Radeon 7000 was added just yesterday it seems.

Dec 10 '23 18:12 wsippel

I have composed this guide for my AMD AI configuration... https://github.com/nktice/AMD-AI The ROCm project that had done flash attention has appeared to work with 5.73. [ https://github.com/nktice/AMD-AI/blob/main/ROCm-5.7.md - I've not tested much, but the exllamav2 warnings that appear when it's not in use disappear once it's installed in this case... ]

Alas it does not work with the ROCm 6 at time of writing. [ https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md - in this case exllamav2 crashes if flash attention ( same as above ) is installed. ]

An issue with this is that the AMD fork is always behind and hard to maintain compared to the main content and developers.

What would be helpful is for AMD's content to be included back into the source, so that they do not have to start from scratch again every time there is any update to the main flash-attention code.

Dec 19 '23 03:12 nktice

@tridao is it possible to merge this to support ROCm?

https://github.com/ROCmSoftwarePlatform/flash-attention

Jan 18 '24 21:01 ehartford

https://github.com/ROCmSoftwarePlatform/flash-attention

I think that's a fork maintained by AMD folks and it's not meant to be merged.

Jan 18 '24 21:01 tridao

I doubt they would disapprove of merging, Seems just a rift of communication. I will reach out.

Jan 18 '24 21:01 ehartford

https://github.com/ROCmSoftwarePlatform/flash-attention

I think that's a fork maintained by AMD folks and it's not meant to be merged.

As it's been a while, and they haven't updated or integrated... I'd like to mention - AMD rarely updates or maintains such things...
It's common for them to abandon such projects with little notice...
Like for example their bits-and-bytes conversion is well out of date - https://github.com/ROCm/bitsandbytes Leading to others improvising for themselves to get things working - [ Here's the most recent working bitsandbytes I've found that works with ROCm... it's well out of date, but not quite as abandoned as AMD's own... ] https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6 There's been no quarrel about peoples' forked versions, and there are a few - but without their help it is something of a mess of mixed offerings.

It is more likely they offered an example of what could be done - and how to do it, so that the 'community' could take it from there. [ If that's not the case, then they'd clearly mention that, or keep it private. ]

I have contacted exllamav2 about the version issue, here is what they said - AMD's offered version isn't of much use... https://github.com/turboderp/exllamav2/issues/397#issuecomment-2034652594

Apr 03 '24 06:04 nktice

Maybe @howiejayz could be part of this conversation =)

Apr 10 '24 10:04 RichardFevrier

Maybe @howiejayz could be part of this conversation =)

Unfortunately I am no longer working on this project :( But as far as I know the other team is still working on this project and it will be long-term support.

Apr 11 '24 06:04 jayz0123

I just submit an PR to support AMD / ROCm on FlashAttention 2 https://github.com/Dao-AILab/flash-attention/pull/1010 This PR using composable_kernel as backend

Jun 26 '24 19:06 rocking5566

Looking at the compile targets, this patch only works on CDNA GPUs i assume? Is RDNA3 support still in the cards?

Jun 26 '24 20:06 wsippel

@wsippel Yes, The new PR only works for MI200 and MI300 for now.

Jun 27 '24 00:06 rocking5566

I have mi100s, would love to be able to use them

Jun 27 '24 00:06 ehartford

I have mi100s, would love to be able to use them

We found MI100 may fail in some of the bf16 test cases. Hence, MI100 is not officially support for now.

Jun 27 '24 13:06 rocking5566

I would like to look into this bf16 issue. Is the cause well understood or in need of research?

Sep 16 '24 01:09 iratebadger

I would like to look into this bf16 issue. Is the cause well understood or in need of research?

We focus on MI300 improvement recently, but MI100 is still in our backlog

Sep 16 '24 18:09 rocking5566

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

Sep 16 '24 18:09 ehartford

I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.

Sep 17 '24 20:09 iratebadger

I would like to concur with ehartford. I'm trying to get the AMD folks to provide more info on the cause of a page fault during the tests which according to ROM folks is a FA issue.

@iratebadger are you using main branch here with MI200 or MI300? But as I know, bf16 in MI100 only 92.3 TFLOPs, fp16 is better (184.6 TFLOPs)

Sep 18 '24 04:09 rocking5566

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100

Sep 18 '24 04:09 rocking5566

I have 24 mi100s, I would much want to add support for mi100s, Is there anything I can do to help?

@ehartford You should ask your AMD sales to increase the priority of MI100 in our roadmap. But you could also try Fp16 in MI100

Thank you for this advice! 😁

Sep 19 '24 13:09 ehartford

I would like to look into this bf16 issue. Is the cause well understood or in need of research?

We focus on MI300 improvement recently, but MI100 is still in our backlog

Hi, I currently am poking on an MI300 system and am getting compile errors with FA. Looks like a HIPifying error with the included CK files. Have you seen this error before?

flash-attention/build/temp.linux-x86_64-cpython-311/csrc/flash_attn_ck/mha_fwd.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 --offload-arch=gfx942 -O3 -std=c++17 -DCK_TILE_FMHA_FWD_FAST_EXP2=1 -fgpu-flush-denormals-to-zero -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_USE_XDL -DUSE_PROF_API=1 -D__HIP_PLATFORM_HCC__=1 -DCK_TILE_FLOAT_TO_BFLOAT16_DEFAULT=3 -fno-offload-uniform-block -mllvm -enable-post-misched=0 -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -mllvm -amdgpu-coerce-illegal-types=1 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -fno-gpu-rdc
/home/hotaisle/flash-attention/csrc/flash_attn_ck/mha_fwd.hip:277:33: error: no member named 'getCurrentHIPStream' in namespace 'at::cuda'; did you mean 'getCurrentCUDAStream'?
  277 |         auto stream = at::cuda::getCurrentHIPStream().stream();
      |                       ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
      |                                 getCurrentCUDAStream
/home/hotaisle/miniforge3/envs/llm/lib/python3.11/site-packages/torch/include/c10/hip/HIPStream.h:244:20: note: 'getCurrentCUDAStream' declared here
  244 | C10_API CUDAStream getCurrentCUDAStream(DeviceIndex device_index = -1);
      |                    ^
1 error generated when compiling for gfx942.

BTW, I've successfully built and tested (passed all tests) and installed the upstream CK (into I believe) from HEAD, which should match the "csrc/composable_kernel" submodule that's being pulled in. (can start a new issue if this is something sufficiently new).

Oct 08 '24 04:10 lhl

dule

@lhl Could you try to compile upstream FA with following script pip install git+ https://github.com/Dao-AILab/flash-attention.git@05b657e97e0a7795e14546c06de07a49923a94e8 -v

Oct 08 '24 08:10 rocking5566

dule

@lhl Could you try to compile upstream FA with following script pip install git+ https://github.com/Dao-AILab/flash-attention.git@05b657e97e0a7795e14546c06de07a49923a94e8 -v

Hey, sorry for the delay, traveling so just got a chance to sit down this morning. I'm still getting a few errors like this. Let me file a proper bug, since I have a lot of logs etc to post:

  299 |         auto stream = at::hip::getCurrentHIPStream().stream();
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                       c10::cuda::getCurrentCUDAStream

Oct 10 '24 18:10 lhl

flash-attention flash-attention copied to clipboard

add support for AMD / ROCm / HIP

flash-attention
flash-attention copied to clipboard