flash-attention icon indicating copy to clipboard operation
flash-attention copied to clipboard

New binaries release needed for PyTorch 2.7.0 (torch2.7.0cu128 / torch2.6.0cu126 + flash_attn-2.7.4.post1 seem broken because PyTorch changed ABI)

Open vadimkantorov opened this issue 6 months ago β€’ 20 comments

python -c 'import flash_attn_2_cuda as flash_attn_gpu'
#Traceback (most recent call last):
#  File "<string>", line 1, in <module>
#ImportError: libc10.so: cannot open shared object file: No such file or directory

python -c 'import torch; import flash_attn_2_cuda as flash_attn_gpu'
#Traceback (most recent call last):
#  File "<string>", line 1, in <module>
#ImportError: /home/inferencer/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

Maybe related to:

  • https://github.com/Dao-AILab/flash-attention/issues/1622#issuecomment-2837482873

vadimkantorov avatar May 04 '25 20:05 vadimkantorov

with torch '2.6.0+cu126' (on cuda12.8 machine), same problem...

vadimkantorov avatar May 04 '25 21:05 vadimkantorov

the only torch which works for pip version of flash_attn is 2.6.0+cu124.

In cu124 version

nm ~/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs
U _ZN3c105ErrorC2ENS_14SourceLocationESs

nm ~/.local/lib/python3.10/site-packages/torch/lib/libc10.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs
# 000000000008e120 T _ZN3c105ErrorC2ENS_14SourceLocationESs
# 00000000000384a0 t _ZN3c105ErrorC2ENS_14SourceLocationESs.cold

In cu126, this symbol disappears:

nm ~/.local/lib/python3.10/site-packages/torch/lib/libc10.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs

Also, a question is why flash_attention depends on _ZN3c105ErrorC2ENS_14SourceLocationESs

vadimkantorov avatar May 04 '25 21:05 vadimkantorov

@tridao as advised by @malfet, the issue is PyTorch updated its C++ ABI in 2.6.0cu126, and it stayed this way in 2.7.0cu128:

  • https://github.com/pytorch/pytorch/issues/152790#issuecomment-2851107741

so probably a new push of flash_attention binaries to pip are needed for 2.6.0cu126, >=2.7.0, otherwise users will have to compile from source...

vadimkantorov avatar May 05 '25 14:05 vadimkantorov

@vadimkantorov do you know if this is the only symbol? It's a bit ugly, but possible to have "multi-ABI" library, so adding pre-CXX11 support for just TORCH_CHECK shouldn't be that hard..

malfet avatar May 05 '25 14:05 malfet

I don't know how to try this. I did nm ~/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so:

flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so.txt

Maybe you can cross-reference the c10 symbols manually against the recent symbols of libc10_cuda.so in 2.7.0?

Some of c10 symbols in flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so are:

                 U _ZN3c1010TensorImpl17set_autograd_metaESt10unique_ptrINS_21AutogradMetaInterfaceESt14default_deleteIS2_EE
00000000001aa270 W _ZN3c1010ValueErrorD0Ev
00000000001a9fa0 W _ZN3c1010ValueErrorD1Ev
00000000001a9fa0 W _ZN3c1010ValueErrorD2Ev
00000000001a93c0 W _ZN3c1011SymNodeImplD0Ev
00000000001ae500 W _ZN3c1011SymNodeImplD1Ev
00000000001ae500 W _ZN3c1011SymNodeImplD2Ev
00000000001af070 W _ZN3c1013intrusive_ptrINS_10TensorImplENS_19UndefinedTensorImplEE6reset_Ev
00000000001af7b0 W _ZN3c1013intrusive_ptrINS_10TensorImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001ae510 W _ZN3c1013intrusive_ptrINS_11SymNodeImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001af000 W _ZN3c1013intrusive_ptrINS_13GeneratorImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001aef90 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEE6reset_Ev
00000000001aff50 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEEC1EPS2_
00000000001aff50 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEEC2EPS2_
00000000001ae4a0 W _ZN3c1013intrusive_ptrINS_20intrusive_ptr_targetENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001a93b0 W _ZN3c1015VariableVersion14VersionCounterD0Ev
00000000001a92d0 W _ZN3c1015VariableVersion14VersionCounterD1Ev
00000000001a92d0 W _ZN3c1015VariableVersion14VersionCounterD2Ev
                 U _ZN3c1019UndefinedTensorImpl10_singletonE
00000000001ac270 W _ZN3c1019fromIntArrayRefSlowENS_8ArrayRefIlEE
00000000001a9090 W _ZN3c1020intrusive_ptr_target17release_resourcesEv
                 U _ZN3c1021AutogradMetaInterfaceD2Ev
                 U _ZN3c1021throwNullDataPtrErrorEv
                 U _ZN3c1021warnDeprecatedDataPtrEv
                 U _ZN3c104cuda12device_countEv
                 U _ZN3c104cuda14ExchangeDeviceEa
                 U _ZN3c104cuda14MaybeSetDeviceEa
                 U _ZN3c104cuda17getStreamFromPoolEba
                 U _ZN3c104cuda17getStreamFromPoolEia
                 U _ZN3c104cuda20CUDACachingAllocator9allocatorE
                 U _ZN3c104cuda20getCurrentCUDAStreamEa
                 U _ZN3c104cuda20getDefaultCUDAStreamEa
                 U _ZN3c104cuda20setCurrentCUDAStreamENS0_10CUDAStreamE
                 U _ZN3c104cuda21warn_or_error_on_syncEv
                 U _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib
00000000001a93e0 W _ZN3c104cuda4impl13CUDAGuardImplD0Ev
00000000001a92c0 W _ZN3c104cuda4impl13CUDAGuardImplD1Ev
00000000001a92c0 W _ZN3c104cuda4impl13CUDAGuardImplD2Ev
                 U _ZN3c104cuda9GetDeviceEPa
                 U _ZN3c104cuda9SetDeviceEa
00000000001a93d0 W _ZN3c104impl16VirtualGuardImplD0Ev
00000000001a92e0 W _ZN3c104impl16VirtualGuardImplD1Ev
00000000001a92e0 W _ZN3c104impl16VirtualGuardImplD2Ev
                 U _ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_14DispatchKeySetE
                 U _ZN3c104impl23ExcludeDispatchKeyGuardD1Ev
                 U _ZN3c104impl26device_guard_impl_registryE
                 U _ZN3c104impl3cow15is_cow_data_ptrERKNS_7DataPtrE
                 U _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE
                 U _ZN3c104impl8GPUTrace13gpuTraceStateE
                 U _ZN3c104impl8GPUTrace9haveStateE
                 U _ZN3c104warnERKNS_7WarningE
                 U _ZN3c105ErrorC2ENS_14SourceLocationESs
                 U _ZN3c106SymInt19promote_to_negativeEv
00000000001aa6c0 W _ZN3c106SymIntC1El
00000000001aa6c0 W _ZN3c106SymIntC2El
00000000001ad4e0 W _ZN3c106detail12_str_wrapperIJPKcRKNS_10DeviceTypeES3_EE4callERKS3_S6_S9_
00000000001b5b60 W _ZN3c106detail12_str_wrapperIJPKcRKNS_10DeviceTypeES3_S6_S3_EE4callERKS3_S6_S9_S6_S9_
00000000001ac6f0 W _ZN3c106detail12_str_wrapperIJPKcRKS3_EE4callES5_S5_
00000000001adda0 W _ZN3c106detail12_str_wrapperIJPKcRKS3_S3_EE4callES5_S5_S5_
00000000001abed0 W _ZN3c106detail12_str_wrapperIJPKcRKlEE4callERKS3_S5_
                 U _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
                 U _ZN3c106detail14torchCheckFailEPKcS2_jS2_
                 U _ZN3c106detail19maybe_wrap_dim_slowIlEET_S2_S2_b
                 U _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_S2_
                 U _ZN3c107WarningC1ESt7variantIJNS0_11UserWarningENS0_18DeprecationWarningEEERKNS_14SourceLocationESsb
00000000004e4ad0 r _ZN3c10L45autograd_dispatch_keyset_with_ADInplaceOrViewE
                 U _ZN3c10lsERSoNS_10DeviceTypeE
                 U _ZN3c10ltERKNS_6SymIntEi

vadimkantorov avatar May 05 '25 15:05 vadimkantorov

Adding this line to the compiler flags may help:

-D_GLIBCXX_USE_CXX11_ABI=$(shell python3 -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)"

In setup.py, these can be added as follows:

cxx11_abi = subprocess.check_output(['python', '-c', "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)"]).decode().strip()
cuda_flags = [
   ... old cuda flags ...
    f'-D_GLIBCXX_USE_CXX11_ABI={cxx11_abi}'
]

DanFu09 avatar May 05 '25 18:05 DanFu09

I'm sorry, but my brain is short-circuiting on translating your psuedocode for the setup.py modifications.

LTSarc avatar May 06 '25 23:05 LTSarc

i got it working , atleast for my build

Zarrac avatar May 09 '25 14:05 Zarrac

i got it working , atleast for my build

Thanks, It help me a tons of hours.

TueVNguyen avatar May 20 '25 21:05 TueVNguyen

Anyone know if/when we can expect a flash-attn release on pypi (and associated released wheels) that supports torch 2.7? Thanks!

dakinggg avatar May 21 '25 22:05 dakinggg

For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

vadimkantorov avatar May 21 '25 22:05 vadimkantorov

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

I confirm your solution works

denadai2 avatar May 22 '25 10:05 denadai2

For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

In my case, I can install them (specifically the python 3.11 version) but then ComfyUI throws me some undefined symbol: _ZN3c104cuda9SetDeviceEi exceptions. I installed @Zarrac wheel and it seems to work fine.

Karlmeister avatar May 26 '25 00:05 Karlmeister

@vadimkantorov 's solution also seems to work on my end. thanks !

konan009 avatar May 29 '25 11:05 konan009

This will be great:

  • https://github.com/Dao-AILab/flash-attention/issues/1696#issuecomment-2966762278

vadimkantorov avatar Jun 14 '25 13:06 vadimkantorov

@tridao as advised by @malfet, the issue is PyTorch updated its C++ ABI in 2.6.0cu126, and it stayed this way in 2.7.0cu128:

* [[CXX11ABI] torch 2.6.0-cu126 and cu124 have different exported symbols pytorch/pytorch#152790 (comment)](https://github.com/pytorch/pytorch/issues/152790#issuecomment-2851107741)

so probably a new push of flash_attention binaries to pip are needed for 2.6.0cu126, >=2.7.0, otherwise users will have to compile from source...

@vadimkantorov @malfet i just spent a lof of coffee and writing here:

https://github.com/Dao-AILab/flash-attention/issues/1717#issuecomment-2984172823

to find out and explain what you said in like 2 lines like a month ago.. πŸ˜‚πŸ˜‚

loscrossos avatar Jun 20 '25 11:06 loscrossos

@vadimkantorov do you know if this is the only symbol? It's a bit ugly, but possible to have "multi-ABI" library, so adding pre-CXX11 support for just TORCH_CHECK shouldn't be that hard..

Is this something that someone could submit a PR for so we can get working binaries again? Compilers are way out of my wheelhouse.

winglian avatar Jul 05 '25 03:07 winglian

For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

It works to me. I change it into python3.11. My torch version is 2.7.1+cu12.6 At first I install wrong whl which β€˜abi’ is False, it shows an new error. After I install true whl, it works.

Thomas333333 avatar Jul 29 '25 12:07 Thomas333333

So does any one have a solution for cuda12.8 + torch 2.7 with pip?

Jerry-hyl avatar Aug 03 '25 12:08 Jerry-hyl

So does any one have a solution for cuda12.8 + torch 2.7 with pip?그럼 λˆ„κ΅¬λ“ μ§€ pipλ₯Ό μ‚¬μš©ν•˜μ—¬ cuda12.8 + torch 2.7에 λŒ€ν•œ 해결책을 μ•Œκ³  μžˆλ‚˜μš”?

It works to me, cuda 12.8, torch 2.7.1 transformers 4.57.0 with RTX 4090+3090

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl

dablro12 avatar Oct 21 '25 12:10 dablro12