flash-attention Implemented Flash Attention2 for Intel GPU hardware

Implemented Flash Attention2 for Intel GPU hardware

Open Wanzizhu opened this issue 8 months ago • 6 comments

Hi,

@tridao, We'd like to add intel backend for flash-attn. And this PR implements the mha_fwd function for intel hardware, such as Intel(R) Arc(TM) B580 Graphics(BMG) and Intel® Core™ Ultra 7 Processor(Lunar Lake). Other APIs, eg mha_bwd/varlen_mha, are working in progress (WIP). The C++ API follows the same design as the original CUDA/ROCm implementation, and the Python interface is reused without changes.

This implementation works seamlessly with stock PyTorch, no third-party dependencies. And, it does not affect the existing support for NVIDIA or ROCm hardware.

cc @pengzhao-intel

Mar 06 '25 04:03 Wanzizhu

@tridao any thought or suggestion for this PR?

Mar 16 '25 06:03 pengzhao-intel

Thanks for this contribution! What happens if user calls a function that's not currently supported (e.g. paged KV or varlen)?

Mar 19 '25 07:03 tridao

Thanks for this contribution! What happens if user calls a function that's not currently supported (e.g. paged KV or varlen)?

Currently, all checks are handled on the kernel side, and an error will be raised if a feature is not supported. We are actively working on adding support for these features.

Mar 19 '25 13:03 Wanzizhu

any word on when this will be merged? very excited for this! I use Intel GPUs

Mar 23 '25 18:03 kdbeall

@Wanzizhu this is great work !
I just went through a couple of files and saw some minor improvements , although did not go through the entire change. (Tagging others for FYI : @jgong5 @rbiessy @mehdi-goli @alcpz)

Mar 27 '25 07:03 abhilash1910

This is excellent work!! And looking forward to using this feature in Intel Client GPUs.

Mar 30 '25 13:03 pengxin99

I would guess this need a rebase now. Or are you waiting on paged attention to merge?

Jun 05 '25 16:06 mirh

flash-attention flash-attention copied to clipboard

Implemented Flash Attention2 for Intel GPU hardware

flash-attention
flash-attention copied to clipboard