flash-attention
flash-attention copied to clipboard
flash_attetion doesn't work on windows+wsl RTX 5090
I'll try to summarize, but basically I need cuda128 because of RTX 5090, so I have to use torch for cuda128, which for the most part is sorted, but flash-attetion is one of the hanging. After getting linking errors I decided to build it myself, so I downloaded the source code onto WSL and then followed the build steps, which are quite simple.
After that ran the tests and for my amusement, it doesn't work
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pytest tests/test_flash_attn.py
================================================= test session starts ==================================================
platform linux -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/c/Users/helto/diffusion-pipe/flash-attention/tests
configfile: pyproject.toml
plugins: anyio-4.8.0, langsmith-0.3.15, docker-3.1.2, hydra-core-1.3.2
collected 508772 items
tests/test_flash_attn.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 0%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF^C^C
To make sure it wasn't torch related problem I ran a simple test:
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.get_arch_list())
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')
>>> zeros_tensor_gpu
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')
As you can see, it's working all fine, cuda is there, sm_120 and compute_120 appears in arch_list.
This is the type of error I get from flash_attention:
return_softmax: bool
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
> out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
q,
k,
v,
None,
alibi_slopes,
dropout_p,
softmax_scale,
causal,
window_size_left,
window_size_right,
softcap,
return_softmax,
None,
)
E RuntimeError: CUDA error: no kernel image is available for execution on the device
E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I can't use diffusion_pipe for that same reason, when the flash_attetion step comes in, I get the same error. Here's the pip for the installed flash_attention to show it was indeed installed:
Using /home/hdanilo/miniconda3/lib/python3.12/site-packages
Finished processing dependencies for flash-attn==2.7.4.post1
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pip list | grep flash
DEPRECATION: Loading egg at /home/hdanilo/miniconda3/lib/python3.12/site-packages/flash_attn-2.7.4.post1-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
flash_attn 2.7.4.post1
flash_attn 2.7.4.post1
Did anyone got around this issue? I've tried many things, pip install, setup.py, downloading someone's else whl, all of them failed hard, it's been a bit of a blocker for me.
Can you check that this line is executed in setup.py? It sets the compiler flag to compile for 5090 etc
https://github.com/Dao-AILab/flash-attention/blob/2f9ef0879a0935c3ca852f7a6a7b7a9c24f41e96/setup.py#L190
this block
if CUDA_HOME is not None:
if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_90,code=sm_90")
if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_100,code=sm_100")
if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_120,code=sm_120")
I can see CUDA_HOME is not None
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ python
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.cpp_extension import (
CppExtension,
CUDAExtension,
CUDA_HOME,
ROCM_HOME,
IS_HIP_EXTENSION,
)... BuildExtension,
... CppExtension,
... CUDAExtension,
... CUDA_HOME,
... ROCM_HOME,
... IS_HIP_EXTENSION,
... )
>>>
>>> CUDA_HOME
'/usr'
>>>
Then we go for
def get_cuda_bare_metal_version(cuda_dir):
raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
output = raw_output.split()
release_idx = output.index("release") + 1
bare_metal_version = parse(output[release_idx].split(",")[0])
return raw_output, bare_metal_version
Knowing CUDA_HOME is /usr then I have to check if command /usr/bin/nvcc -V has "release" in it, and that just after the release word I have the version:
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ /usr/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ /usr/bin/nvcc -V | grep release
Cuda compilation tools, release 12.0, V12.0.140
Seems to match to me, it's probably getting 12.0, since it's splitting "," and then getting index 0
Now for the last part
@functools.lru_cache(maxsize=None)
def cuda_archs() -> str:
return os.getenv("FLASH_ATTN_CUDA_ARCHS", "80;90;100;120").split(";")
This is most likely where the problem is, unless this is defined within the same script, I don't have that!
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ echo $FLASH_ATTN_CUDA_ARCHS
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$
But then, it should be getting the default definition which is "80;90;100;120" which coming back to the initial block:
if CUDA_HOME is not None:
if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_90,code=sm_90")
if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_100,code=sm_100")
if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
cc_flag.append("-gencode")
cc_flag.append("arch=compute_120,code=sm_120")
should be appending for all archs
I modified it slightly to test:
if CUDA_HOME is not None:
print("got here")
if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
print("11.8 and 90")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_90,code=sm_90")
if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
print("12.8 and 100")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_100,code=sm_100")
if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
print("12.8 and 120")
cc_flag.append("-gencode")
cc_flag.append("arch=compute_120,code=sm_120")
raise Exception("Lets halt here")
and run
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python setup.py install
torch.__version__ = 2.8.0.dev20250327+cu128
got here
11.8 and 90
Traceback (most recent call last):
File "/mnt/c/Users/helto/diffusion-pipe/flash-attention/setup.py", line 195, in <module>
raise Exception("Lets halt here")
Exception: Lets halt here
I hope that helped somehow!
The problem seems to be that the bare_metal_version is reporting 12.0, and the condition is checking for >= Version(12.8)
different from nvcc, nvidia-smi is reporting 12.8, perhaps that's the right place to look for the cuda version?
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ nvidia-smi
Sat Mar 29 02:21:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 572.70 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:0B:00.0 On | N/A |
| 80% 34C P1 84W / 600W | 4735MiB / 32607MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 24 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ which nvidia-smi
/usr/lib/wsl/lib/nvidia-smi
I forced that statement to be part of the setup.py, and not surprisingly I got an error because of that version.
running build_ext
/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:479: UserWarning: The detected CUDA version (12.0) has a minor version mismatch with the version that was used to compile PyTorch (12.8). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:489: UserWarning: There are no g++ version bounds defined for CUDA version 12.0
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'flash_attn_2_cuda' extension
Emitting ninja build file /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/build.ninja...
Compiling objects...
Using envvar MAX_JOBS (4) as the number of workers...
[1/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal : Unsupported gpu architecture 'compute_120'
[2/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal : Unsupported gpu architecture 'compute_120'
[3/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal : Unsupported gpu architecture 'compute_120'
[4/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal : Unsupported gpu architecture 'compute_120'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2543, in _run_ninja_build
subprocess.run(
File "/home/hdanilo/miniconda3/lib/python3.12/subprocess.py", line 573, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '4']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/c/Users/helto/diffusion-pipe/flash-attention/setup.py", line 604, in <module>
setup(
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/__init__.py", line 117, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands
dist.run_commands()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands
self.run_command(cmd)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
super().run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install.py", line 109, in run
self.do_egg_install()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install.py", line 167, in do_egg_install
self.run_command('bdist_egg')
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
super().run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/bdist_egg.py", line 177, in run
cmd = self.call_command('install_lib', warn_dir=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/bdist_egg.py", line 163, in call_command
self.run_command(cmdname)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
super().run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install_lib.py", line 19, in run
self.build()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/install_lib.py", line 113, in build
self.run_command('build_ext')
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
super().run_command(command)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 99, in run
_build_ext.run(self)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 375, in run
self.build_extensions()
File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1041, in build_extensions
build_ext.build_extensions(self)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 491, in build_extensions
self._build_extensions_serial()
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 517, in _build_extensions_serial
self.build_extension(ext)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
_build_ext.build_extension(self, ext)
File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 572, in build_extension
objects = self.compiler.compile(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 825, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2196, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2560, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
Now I need to find how WSL ended up with that version and how to revert to 12.8
Funny enough, the NVCC from windows(outside wsl) is reporting correctly, wsl is version 2
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0
So at some point something corrupted wsl nvcc, not sure exactly why and where, I'm figuring out how to revert this and I'll keep this issue updated for future troubleshooters
Fixing the issue:
- uninstall cuda-toolkit from wsl, you might want to gracefully uninstall it, I purged everything with nvidia cuda insightface
- install it again
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=runfile_local - assess that version from
nvcc -Vandnvidia-smiare both 12.8.
I suspect wsl is coming with outdated versions of the toolkit because I have no memory of installing it
Right, you need nvcc version >= 12.8 to compile for 5090.
Journey continues...
This time I made sure I had the right nvidia-smi and nvcc matching, purged torch, purged triton, downloaded everything again, and compiled the flash-attention using python setup.py install.
It took about 5h to compile everything with MAX-JOBS=4, probably because of the extra architectures provided by the 5090. But once I ran the tests using python -m pytest tests/test_flash_attn.py then I was able to verify I still have the same problem:
q = tensor([[[[-9.2480e-01, -4.2529e-01, -2.6445e+00, ..., 1.2852e+00,
7.5732e-01, -8.3154e-01],
[...748e-01, 4.0869e-01, ..., 1.2402e-01,
8.7256e-01, -1.4980e+00]]]], device='cuda:0', dtype=torch.float16)
k = tensor([[[[-1.4443e+00, 2.9565e-01, -2.4976e-01, ..., 5.6299e-01,
2.9443e-01, -1.0242e-01],
[...705e-01, -1.1924e+00, ..., -5.3174e-01,
-7.9297e-01, -1.4980e+00]]]], device='cuda:0', dtype=torch.float16)
v = tensor([[[[-1.2002e+00, 1.6396e+00, -4.3915e-02, ..., -7.9785e-01,
-2.2363e-01, -4.7534e-01],
[...305e+00, -1.0029e+00, ..., 6.4941e-01,
-3.3716e-01, 5.4626e-02]]]], device='cuda:0', dtype=torch.float16)
dropout_p = 0.0, softmax_scale = 0.07905694150420949, causal = True, window_size_left = 684, window_size_right = 559
softcap = 0.0, alibi_slopes = None, return_softmax = False
@_torch_custom_op_wrapper("flash_attn::_flash_attn_forward", mutates_args=(), device_types="cuda")
def _flash_attn_forward(
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
dropout_p: float,
softmax_scale: float,
causal: bool,
window_size_left: int,
window_size_right: int,
softcap: float,
alibi_slopes: Optional[torch.Tensor],
return_softmax: bool
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
> out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
q,
k,
v,
None,
alibi_slopes,
dropout_p,
softmax_scale,
causal,
window_size_left,
window_size_right,
softcap,
return_softmax,
None,
)
E RuntimeError: CUDA error: no kernel image is available for execution on the device
E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
flash_attn/flash_attn_interface.py:96: RuntimeError
So we're dealing with something more, aside from the nvcc and nvidia-smi mismatch.
I'll try to summarize, but basically I need cuda128 because of RTX 5090, so I have to use torch for cuda128, which for the most part is sorted, but flash-attetion is one of the hanging. After getting linking errors I decided to build it myself, so I downloaded the source code onto WSL and then followed the build steps, which are quite simple.
After that ran the tests and for my amusement, it doesn't work
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pytest tests/test_flash_attn.py ================================================= test session starts ================================================== platform linux -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0 rootdir: /mnt/c/Users/helto/diffusion-pipe/flash-attention/tests configfile: pyproject.toml plugins: anyio-4.8.0, langsmith-0.3.15, docker-3.1.2, hydra-core-1.3.2 collected 508772 items tests/test_flash_attn.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 0%] FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF^C^CTo make sure it wasn't torch related problem I ran a simple test:
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> print(torch.cuda.is_available()) True >>> print(torch.cuda.get_arch_list()) ['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120'] >>> zeros_tensor_gpu = torch.zeros((50, 50), device='cuda') >>> zeros_tensor_gpu tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')As you can see, it's working all fine, cuda is there, sm_120 and compute_120 appears in arch_list.
This is the type of error I get from flash_attention:
return_softmax: bool ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: q, k, v = [maybe_contiguous(x) for x in (q, k, v)] > out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd( q, k, v, None, alibi_slopes, dropout_p, softmax_scale, causal, window_size_left, window_size_right, softcap, return_softmax, None, ) E RuntimeError: CUDA error: no kernel image is available for execution on the device E CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. E For debugging consider passing CUDA_LAUNCH_BLOCKING=1 E Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.I can't use diffusion_pipe for that same reason, when the flash_attetion step comes in, I get the same error. Here's the pip for the installed flash_attention to show it was indeed installed:
Using /home/hdanilo/miniconda3/lib/python3.12/site-packages Finished processing dependencies for flash-attn==2.7.4.post1 (base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pip list | grep flash DEPRECATION: Loading egg at /home/hdanilo/miniconda3/lib/python3.12/site-packages/flash_attn-2.7.4.post1-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 flash_attn 2.7.4.post1 flash_attn 2.7.4.post1Did anyone got around this issue? I've tried many things, pip install, setup.py, downloading someone's else whl, all of them failed hard, it's been a bit of a blocker for me.
For WSL I have been using Kijai's precompiled wheels for a month with no issues - https://huggingface.co/Kijai/PrecompiledWheels/tree/main
For Windows - I need a solution too as I want to run Pinokio for YUE in windows, and it won't work without Flash
I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure
Kijai's giving me the same problem too, so its certainly a configuration issue, but what is the keypoint, everything else besides flash_attn has been working well, I'm able to use sage_attention, pytorch3d, pytorch, the only thing limiting me right now is the flash_attention
I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure
can you send a wheel for windows ? :D mine is compiling right now and I'l leave it overnight but I have do doubts
I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure
can you send a wheel for windows ? :D mine is compiling right now and I'l leave it overnight but I have do doubts
oh dang, I should have saved it, I lost it on the WSL madness, is there a way to convert the installed egg into a whl?
actually following this instruction and building from x64 Native Tools Command Prompt for VS 2022 helped
https://huggingface.co/lldacing/flash-attention-windows-wheel
Coming back here to confirm, purged my WSL2 and created a new distro from scratch, this time I was able to compile everything correctly, there must have been some lingering .so on the WSL2 that even reinstalling the cuda-tool-kit was enough to offset the whole thing