flash-attention
flash-attention copied to clipboard
New binaries release needed for PyTorch 2.7.0 (torch2.7.0cu128 / torch2.6.0cu126 + flash_attn-2.7.4.post1 seem broken because PyTorch changed ABI)
python -c 'import flash_attn_2_cuda as flash_attn_gpu'
#Traceback (most recent call last):
# File "<string>", line 1, in <module>
#ImportError: libc10.so: cannot open shared object file: No such file or directory
python -c 'import torch; import flash_attn_2_cuda as flash_attn_gpu'
#Traceback (most recent call last):
# File "<string>", line 1, in <module>
#ImportError: /home/inferencer/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs
Maybe related to:
- https://github.com/Dao-AILab/flash-attention/issues/1622#issuecomment-2837482873
with torch '2.6.0+cu126' (on cuda12.8 machine), same problem...
the only torch which works for pip version of flash_attn is 2.6.0+cu124.
In cu124 version
nm ~/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs
U _ZN3c105ErrorC2ENS_14SourceLocationESs
nm ~/.local/lib/python3.10/site-packages/torch/lib/libc10.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs
# 000000000008e120 T _ZN3c105ErrorC2ENS_14SourceLocationESs
# 00000000000384a0 t _ZN3c105ErrorC2ENS_14SourceLocationESs.cold
In cu126, this symbol disappears:
nm ~/.local/lib/python3.10/site-packages/torch/lib/libc10.so | grep _ZN3c105ErrorC2ENS_14SourceLocationESs
Also, a question is why flash_attention depends on _ZN3c105ErrorC2ENS_14SourceLocationESs
@tridao as advised by @malfet, the issue is PyTorch updated its C++ ABI in 2.6.0cu126, and it stayed this way in 2.7.0cu128:
- https://github.com/pytorch/pytorch/issues/152790#issuecomment-2851107741
so probably a new push of flash_attention binaries to pip are needed for 2.6.0cu126, >=2.7.0, otherwise users will have to compile from source...
@vadimkantorov do you know if this is the only symbol? It's a bit ugly, but possible to have "multi-ABI" library, so adding pre-CXX11 support for just TORCH_CHECK shouldn't be that hard..
I don't know how to try this. I did nm ~/.local/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so:
flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so.txt
Maybe you can cross-reference the c10 symbols manually against the recent symbols of libc10_cuda.so in 2.7.0?
Some of c10 symbols in flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so are:
U _ZN3c1010TensorImpl17set_autograd_metaESt10unique_ptrINS_21AutogradMetaInterfaceESt14default_deleteIS2_EE
00000000001aa270 W _ZN3c1010ValueErrorD0Ev
00000000001a9fa0 W _ZN3c1010ValueErrorD1Ev
00000000001a9fa0 W _ZN3c1010ValueErrorD2Ev
00000000001a93c0 W _ZN3c1011SymNodeImplD0Ev
00000000001ae500 W _ZN3c1011SymNodeImplD1Ev
00000000001ae500 W _ZN3c1011SymNodeImplD2Ev
00000000001af070 W _ZN3c1013intrusive_ptrINS_10TensorImplENS_19UndefinedTensorImplEE6reset_Ev
00000000001af7b0 W _ZN3c1013intrusive_ptrINS_10TensorImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001ae510 W _ZN3c1013intrusive_ptrINS_11SymNodeImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001af000 W _ZN3c1013intrusive_ptrINS_13GeneratorImplENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001aef90 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEE6reset_Ev
00000000001aff50 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEEC1EPS2_
00000000001aff50 W _ZN3c1013intrusive_ptrINS_15VariableVersion14VersionCounterENS_6detail34intrusive_target_default_null_typeIS2_EEEC2EPS2_
00000000001ae4a0 W _ZN3c1013intrusive_ptrINS_20intrusive_ptr_targetENS_6detail34intrusive_target_default_null_typeIS1_EEE6reset_Ev
00000000001a93b0 W _ZN3c1015VariableVersion14VersionCounterD0Ev
00000000001a92d0 W _ZN3c1015VariableVersion14VersionCounterD1Ev
00000000001a92d0 W _ZN3c1015VariableVersion14VersionCounterD2Ev
U _ZN3c1019UndefinedTensorImpl10_singletonE
00000000001ac270 W _ZN3c1019fromIntArrayRefSlowENS_8ArrayRefIlEE
00000000001a9090 W _ZN3c1020intrusive_ptr_target17release_resourcesEv
U _ZN3c1021AutogradMetaInterfaceD2Ev
U _ZN3c1021throwNullDataPtrErrorEv
U _ZN3c1021warnDeprecatedDataPtrEv
U _ZN3c104cuda12device_countEv
U _ZN3c104cuda14ExchangeDeviceEa
U _ZN3c104cuda14MaybeSetDeviceEa
U _ZN3c104cuda17getStreamFromPoolEba
U _ZN3c104cuda17getStreamFromPoolEia
U _ZN3c104cuda20CUDACachingAllocator9allocatorE
U _ZN3c104cuda20getCurrentCUDAStreamEa
U _ZN3c104cuda20getDefaultCUDAStreamEa
U _ZN3c104cuda20setCurrentCUDAStreamENS0_10CUDAStreamE
U _ZN3c104cuda21warn_or_error_on_syncEv
U _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib
00000000001a93e0 W _ZN3c104cuda4impl13CUDAGuardImplD0Ev
00000000001a92c0 W _ZN3c104cuda4impl13CUDAGuardImplD1Ev
00000000001a92c0 W _ZN3c104cuda4impl13CUDAGuardImplD2Ev
U _ZN3c104cuda9GetDeviceEPa
U _ZN3c104cuda9SetDeviceEa
00000000001a93d0 W _ZN3c104impl16VirtualGuardImplD0Ev
00000000001a92e0 W _ZN3c104impl16VirtualGuardImplD1Ev
00000000001a92e0 W _ZN3c104impl16VirtualGuardImplD2Ev
U _ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_14DispatchKeySetE
U _ZN3c104impl23ExcludeDispatchKeyGuardD1Ev
U _ZN3c104impl26device_guard_impl_registryE
U _ZN3c104impl3cow15is_cow_data_ptrERKNS_7DataPtrE
U _ZN3c104impl3cow23materialize_cow_storageERNS_11StorageImplE
U _ZN3c104impl8GPUTrace13gpuTraceStateE
U _ZN3c104impl8GPUTrace9haveStateE
U _ZN3c104warnERKNS_7WarningE
U _ZN3c105ErrorC2ENS_14SourceLocationESs
U _ZN3c106SymInt19promote_to_negativeEv
00000000001aa6c0 W _ZN3c106SymIntC1El
00000000001aa6c0 W _ZN3c106SymIntC2El
00000000001ad4e0 W _ZN3c106detail12_str_wrapperIJPKcRKNS_10DeviceTypeES3_EE4callERKS3_S6_S9_
00000000001b5b60 W _ZN3c106detail12_str_wrapperIJPKcRKNS_10DeviceTypeES3_S6_S3_EE4callERKS3_S6_S9_S6_S9_
00000000001ac6f0 W _ZN3c106detail12_str_wrapperIJPKcRKS3_EE4callES5_S5_
00000000001adda0 W _ZN3c106detail12_str_wrapperIJPKcRKS3_S3_EE4callES5_S5_S5_
00000000001abed0 W _ZN3c106detail12_str_wrapperIJPKcRKlEE4callERKS3_S5_
U _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
U _ZN3c106detail14torchCheckFailEPKcS2_jS2_
U _ZN3c106detail19maybe_wrap_dim_slowIlEET_S2_S2_b
U _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_S2_
U _ZN3c107WarningC1ESt7variantIJNS0_11UserWarningENS0_18DeprecationWarningEEERKNS_14SourceLocationESsb
00000000004e4ad0 r _ZN3c10L45autograd_dispatch_keyset_with_ADInplaceOrViewE
U _ZN3c10lsERSoNS_10DeviceTypeE
U _ZN3c10ltERKNS_6SymIntEi
Adding this line to the compiler flags may help:
-D_GLIBCXX_USE_CXX11_ABI=$(shell python3 -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)"
In setup.py, these can be added as follows:
cxx11_abi = subprocess.check_output(['python', '-c', "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)"]).decode().strip()
cuda_flags = [
... old cuda flags ...
f'-D_GLIBCXX_USE_CXX11_ABI={cxx11_abi}'
]
I'm sorry, but my brain is short-circuiting on translating your psuedocode for the setup.py modifications.
i got it working , atleast for my build
Anyone know if/when we can expect a flash-attn release on pypi (and associated released wheels) that supports torch 2.7? Thanks!
For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
I confirm your solution works
For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
In my case, I can install them (specifically the python 3.11 version) but then ComfyUI throws me some undefined symbol: _ZN3c104cuda9SetDeviceEi exceptions.
I installed @Zarrac wheel and it seems to work fine.
@vadimkantorov 's solution also seems to work on my end. thanks !
This will be great:
- https://github.com/Dao-AILab/flash-attention/issues/1696#issuecomment-2966762278
@tridao as advised by @malfet, the issue is PyTorch updated its C++ ABI in 2.6.0cu126, and it stayed this way in 2.7.0cu128:
* [[CXX11ABI] torch 2.6.0-cu126 and cu124 have different exported symbols pytorch/pytorch#152790 (comment)](https://github.com/pytorch/pytorch/issues/152790#issuecomment-2851107741)so probably a new push of flash_attention binaries to pip are needed for 2.6.0cu126, >=2.7.0, otherwise users will have to compile from source...
@vadimkantorov @malfet i just spent a lof of coffee and writing here:
https://github.com/Dao-AILab/flash-attention/issues/1717#issuecomment-2984172823
to find out and explain what you said in like 2 lines like a month ago.. ππ
@vadimkantorov do you know if this is the only symbol? It's a bit ugly, but possible to have "multi-ABI" library, so adding pre-CXX11 support for just
TORCH_CHECKshouldn't be that hard..
Is this something that someone could submit a PR for so we can get working binaries again? Compilers are way out of my wheelhouse.
For me, for now sth like this fixed the issue (make sure to adjust cp310) - seems to be working with torch2.7 even if this was build for torch2.6:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
It works to me. I change it into python3.11. My torch version is 2.7.1+cu12.6 At first I install wrong whl which βabiβ is False, it shows an new error. After I install true whl, it works.
So does any one have a solution for cuda12.8 + torch 2.7 with pip?
So does any one have a solution for cuda12.8 + torch 2.7 with pip?κ·ΈλΌ λꡬλ μ§ pipλ₯Ό μ¬μ©νμ¬ cuda12.8 + torch 2.7μ λν ν΄κ²°μ± μ μκ³ μλμ?
It works to me, cuda 12.8, torch 2.7.1 transformers 4.57.0 with RTX 4090+3090
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiTRUE-cp310-cp310-linux_x86_64.whl