flash-attention flash_attetion doesn't work on windows+wsl RTX 5090

trafficstars

I'll try to summarize, but basically I need cuda128 because of RTX 5090, so I have to use torch for cuda128, which for the most part is sorted, but flash-attetion is one of the hanging. After getting linking errors I decided to build it myself, so I downloaded the source code onto WSL and then followed the build steps, which are quite simple.

After that ran the tests and for my amusement, it doesn't work

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pytest tests/test_flash_attn.py
================================================= test session starts ==================================================
platform linux -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/c/Users/helto/diffusion-pipe/flash-attention/tests
configfile: pyproject.toml
plugins: anyio-4.8.0, langsmith-0.3.15, docker-3.1.2, hydra-core-1.3.2
collected 508772 items

tests/test_flash_attn.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [  0%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF^C^C

To make sure it wasn't torch related problem I ran a simple test:

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.get_arch_list())
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')
>>> zeros_tensor_gpu
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

As you can see, it's working all fine, cuda is there, sm_120 and compute_120 appears in arch_list.

This is the type of error I get from flash_attention:

        return_softmax: bool
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
>       out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
            q,
            k,
            v,
            None,
            alibi_slopes,
            dropout_p,
            softmax_scale,
            causal,
            window_size_left,
            window_size_right,
            softcap,
            return_softmax,
            None,
        )
E       RuntimeError: CUDA error: no kernel image is available for execution on the device
E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I can't use diffusion_pipe for that same reason, when the flash_attetion step comes in, I get the same error. Here's the pip for the installed flash_attention to show it was indeed installed:

Using /home/hdanilo/miniconda3/lib/python3.12/site-packages
Finished processing dependencies for flash-attn==2.7.4.post1
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pip list | grep flash
DEPRECATION: Loading egg at /home/hdanilo/miniconda3/lib/python3.12/site-packages/flash_attn-2.7.4.post1-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
flash_attn                               2.7.4.post1
flash_attn                               2.7.4.post1

Did anyone got around this issue? I've tried many things, pip install, setup.py, downloading someone's else whl, all of them failed hard, it's been a bit of a blocker for me.

Mar 29 '25 02:03 HDANILO

Can you check that this line is executed in setup.py? It sets the compiler flag to compile for 5090 etc

https://github.com/Dao-AILab/flash-attention/blob/2f9ef0879a0935c3ca852f7a6a7b7a9c24f41e96/setup.py#L190

Mar 29 '25 04:03 tridao

this block

    if CUDA_HOME is not None:
        if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_90,code=sm_90")
        if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_100,code=sm_100")
        if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_120,code=sm_120")

I can see CUDA_HOME is not None

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ python
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from torch.utils.cpp_extension import (
CppExtension,
    CUDAExtension,
    CUDA_HOME,
    ROCM_HOME,
    IS_HIP_EXTENSION,
)...     BuildExtension,
...     CppExtension,
...     CUDAExtension,
...     CUDA_HOME,
...     ROCM_HOME,
...     IS_HIP_EXTENSION,
... )
>>>
>>> CUDA_HOME
'/usr'
>>>

Then we go for

def get_cuda_bare_metal_version(cuda_dir):
    raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
    output = raw_output.split()
    release_idx = output.index("release") + 1
    bare_metal_version = parse(output[release_idx].split(",")[0])

    return raw_output, bare_metal_version

Knowing CUDA_HOME is /usr then I have to check if command /usr/bin/nvcc -V has "release" in it, and that just after the release word I have the version:

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ /usr/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ /usr/bin/nvcc -V | grep release
Cuda compilation tools, release 12.0, V12.0.140

Seems to match to me, it's probably getting 12.0, since it's splitting "," and then getting index 0

Now for the last part

@functools.lru_cache(maxsize=None)
def cuda_archs() -> str:
    return os.getenv("FLASH_ATTN_CUDA_ARCHS", "80;90;100;120").split(";")

This is most likely where the problem is, unless this is defined within the same script, I don't have that!

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ echo $FLASH_ATTN_CUDA_ARCHS

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$

But then, it should be getting the default definition which is "80;90;100;120" which coming back to the initial block:

    if CUDA_HOME is not None:
        if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_90,code=sm_90")
        if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_100,code=sm_100")
        if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_120,code=sm_120")

should be appending for all archs

I modified it slightly to test:

    if CUDA_HOME is not None:
        print("got here")
        if bare_metal_version >= Version("11.8") and "90" in cuda_archs():
            print("11.8 and 90")
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_90,code=sm_90")
        if bare_metal_version >= Version("12.8") and "100" in cuda_archs():
            print("12.8 and 100")
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_100,code=sm_100")
        if bare_metal_version >= Version("12.8") and "120" in cuda_archs():
            print("12.8 and 120")
            cc_flag.append("-gencode")
            cc_flag.append("arch=compute_120,code=sm_120")
    raise Exception("Lets halt here")

and run

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python setup.py install


torch.__version__  = 2.8.0.dev20250327+cu128


got here
11.8 and 90
Traceback (most recent call last):
  File "/mnt/c/Users/helto/diffusion-pipe/flash-attention/setup.py", line 195, in <module>
    raise Exception("Lets halt here")
Exception: Lets halt here

I hope that helped somehow!

Mar 29 '25 05:03 HDANILO

The problem seems to be that the bare_metal_version is reporting 12.0, and the condition is checking for >= Version(12.8)

Mar 29 '25 05:03 HDANILO

different from nvcc, nvidia-smi is reporting 12.8, perhaps that's the right place to look for the cuda version?

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ nvidia-smi
Sat Mar 29 02:21:35 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 572.70         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:0B:00.0  On |                  N/A |
| 80%   34C    P1             84W /  600W |    4735MiB /  32607MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              24      G   /Xwayland                             N/A      |
+-----------------------------------------------------------------------------------------+
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto$ which nvidia-smi
/usr/lib/wsl/lib/nvidia-smi

Mar 29 '25 05:03 HDANILO

I forced that statement to be part of the setup.py, and not surprisingly I got an error because of that version.

running build_ext
/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:479: UserWarning: The detected CUDA version (12.0) has a minor version mismatch with the version that was used to compile PyTorch (12.8). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:489: UserWarning: There are no g++ version bounds defined for CUDA version 12.0
  warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'flash_attn_2_cuda' extension
Emitting ninja build file /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/build.ninja...
Compiling objects...
Using envvar MAX_JOBS (4) as the number of workers...
[1/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal   : Unsupported gpu architecture 'compute_120'
[2/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal   : Unsupported gpu architecture 'compute_120'
[3/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal   : Unsupported gpu architecture 'compute_120'
[4/84] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src -I/mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/cutlass/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include -I/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/home/hdanilo/miniconda3/include/python3.12 -c -c /mnt/c/Users/helto/diffusion-pipe/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /mnt/c/Users/helto/diffusion-pipe/flash-attention/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_120,code=sm_120 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
nvcc fatal   : Unsupported gpu architecture 'compute_120'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2543, in _run_ninja_build
    subprocess.run(
  File "/home/hdanilo/miniconda3/lib/python3.12/subprocess.py", line 573, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '4']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/c/Users/helto/diffusion-pipe/flash-attention/setup.py", line 604, in <module>
    setup(
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/__init__.py", line 117, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands
    dist.run_commands()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands
    self.run_command(cmd)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
    cmd_obj.run()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install.py", line 109, in run
    self.do_egg_install()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install.py", line 167, in do_egg_install
    self.run_command('bdist_egg')
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
    self.distribution.run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
    cmd_obj.run()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/bdist_egg.py", line 177, in run
    cmd = self.call_command('install_lib', warn_dir=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/bdist_egg.py", line 163, in call_command
    self.run_command(cmdname)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
    self.distribution.run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
    cmd_obj.run()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/install_lib.py", line 19, in run
    self.build()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/install_lib.py", line 113, in build
    self.run_command('build_ext')
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
    self.distribution.run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/dist.py", line 999, in run_command
    super().run_command(command)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
    cmd_obj.run()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 99, in run
    _build_ext.run(self)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 375, in run
    self.build_extensions()
  File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1041, in build_extensions
    build_ext.build_extensions(self)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 491, in build_extensions
    self._build_extensions_serial()
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 517, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 264, in build_extension
    _build_ext.build_extension(self, ext)
  File "/home/hdanilo/.local/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 572, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 825, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2196, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/home/hdanilo/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2560, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

Now I need to find how WSL ended up with that version and how to revert to 12.8

Mar 29 '25 05:03 HDANILO

Funny enough, the NVCC from windows(outside wsl) is reporting correctly, wsl is version 2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

So at some point something corrupted wsl nvcc, not sure exactly why and where, I'm figuring out how to revert this and I'll keep this issue updated for future troubleshooters

Mar 29 '25 05:03 HDANILO

Fixing the issue:

uninstall cuda-toolkit from wsl, you might want to gracefully uninstall it, I purged everything with nvidia cuda insightface
install it again https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=runfile_local
assess that version from nvcc -V and nvidia-smi are both 12.8.

I suspect wsl is coming with outdated versions of the toolkit because I have no memory of installing it

Mar 29 '25 06:03 HDANILO

Right, you need nvcc version >= 12.8 to compile for 5090.

Mar 29 '25 06:03 tridao

Journey continues...

This time I made sure I had the right nvidia-smi and nvcc matching, purged torch, purged triton, downloaded everything again, and compiled the flash-attention using python setup.py install.

It took about 5h to compile everything with MAX-JOBS=4, probably because of the extra architectures provided by the 5090. But once I ran the tests using python -m pytest tests/test_flash_attn.py then I was able to verify I still have the same problem:

q = tensor([[[[-9.2480e-01, -4.2529e-01, -2.6445e+00,  ...,  1.2852e+00,
            7.5732e-01, -8.3154e-01],
          [...748e-01,  4.0869e-01,  ...,  1.2402e-01,
            8.7256e-01, -1.4980e+00]]]], device='cuda:0', dtype=torch.float16)
k = tensor([[[[-1.4443e+00,  2.9565e-01, -2.4976e-01,  ...,  5.6299e-01,
            2.9443e-01, -1.0242e-01],
          [...705e-01, -1.1924e+00,  ..., -5.3174e-01,
           -7.9297e-01, -1.4980e+00]]]], device='cuda:0', dtype=torch.float16)
v = tensor([[[[-1.2002e+00,  1.6396e+00, -4.3915e-02,  ..., -7.9785e-01,
           -2.2363e-01, -4.7534e-01],
          [...305e+00, -1.0029e+00,  ...,  6.4941e-01,
           -3.3716e-01,  5.4626e-02]]]], device='cuda:0', dtype=torch.float16)
dropout_p = 0.0, softmax_scale = 0.07905694150420949, causal = True, window_size_left = 684, window_size_right = 559
softcap = 0.0, alibi_slopes = None, return_softmax = False

    @_torch_custom_op_wrapper("flash_attn::_flash_attn_forward", mutates_args=(), device_types="cuda")
    def _flash_attn_forward(
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        dropout_p: float,
        softmax_scale: float,
        causal: bool,
        window_size_left: int,
        window_size_right: int,
        softcap: float,
        alibi_slopes: Optional[torch.Tensor],
        return_softmax: bool
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
>       out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
            q,
            k,
            v,
            None,
            alibi_slopes,
            dropout_p,
            softmax_scale,
            causal,
            window_size_left,
            window_size_right,
            softcap,
            return_softmax,
            None,
        )
E       RuntimeError: CUDA error: no kernel image is available for execution on the device
E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

flash_attn/flash_attn_interface.py:96: RuntimeError

So we're dealing with something more, aside from the nvcc and nvidia-smi mismatch.

Mar 29 '25 19:03 HDANILO

I'll try to summarize, but basically I need cuda128 because of RTX 5090, so I have to use torch for cuda128, which for the most part is sorted, but flash-attetion is one of the hanging. After getting linking errors I decided to build it myself, so I downloaded the source code onto WSL and then followed the build steps, which are quite simple.

After that ran the tests and for my amusement, it doesn't work

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pytest tests/test_flash_attn.py
================================================= test session starts ==================================================
platform linux -- Python 3.12.9, pytest-8.3.5, pluggy-1.5.0
rootdir: /mnt/c/Users/helto/diffusion-pipe/flash-attention/tests
configfile: pyproject.toml
plugins: anyio-4.8.0, langsmith-0.3.15, docker-3.1.2, hydra-core-1.3.2
collected 508772 items

tests/test_flash_attn.py FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [  0%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF^C^C

To make sure it wasn't torch related problem I ran a simple test:

(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python
Python 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.get_arch_list())
['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')
>>> zeros_tensor_gpu
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

As you can see, it's working all fine, cuda is there, sm_120 and compute_120 appears in arch_list.

This is the type of error I get from flash_attention:

        return_softmax: bool
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
        q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
>       out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.fwd(
            q,
            k,
            v,
            None,
            alibi_slopes,
            dropout_p,
            softmax_scale,
            causal,
            window_size_left,
            window_size_right,
            softcap,
            return_softmax,
            None,
        )
E       RuntimeError: CUDA error: no kernel image is available for execution on the device
E       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1
E       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I can't use diffusion_pipe for that same reason, when the flash_attetion step comes in, I get the same error. Here's the pip for the installed flash_attention to show it was indeed installed:

Using /home/hdanilo/miniconda3/lib/python3.12/site-packages
Finished processing dependencies for flash-attn==2.7.4.post1
(base) hdanilo@DragonRollDev:/mnt/c/Users/helto/diffusion-pipe/flash-attention$ python -m pip list | grep flash
DEPRECATION: Loading egg at /home/hdanilo/miniconda3/lib/python3.12/site-packages/flash_attn-2.7.4.post1-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
flash_attn                               2.7.4.post1
flash_attn                               2.7.4.post1

Did anyone got around this issue? I've tried many things, pip install, setup.py, downloading someone's else whl, all of them failed hard, it's been a bit of a blocker for me.

For WSL I have been using Kijai's precompiled wheels for a month with no issues - https://huggingface.co/Kijai/PrecompiledWheels/tree/main

For Windows - I need a solution too as I want to run Pinokio for YUE in windows, and it won't work without Flash

Mar 30 '25 00:03 adamreading

I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure

Mar 30 '25 01:03 HDANILO

Kijai's giving me the same problem too, so its certainly a configuration issue, but what is the keypoint, everything else besides flash_attn has been working well, I'm able to use sage_attention, pytorch3d, pytorch, the only thing limiting me right now is the flash_attention

Mar 30 '25 01:03 HDANILO

I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure

can you send a wheel for windows ? :D mine is compiling right now and I'l leave it overnight but I have do doubts

Mar 30 '25 02:03 siraxe

I'm gonna try Kijai's to see whats up. I managed to get flash working on windows, but it took 11h to compile, it seems to be working fine on windows, but tbh I didn't dig too much to be sure

can you send a wheel for windows ? :D mine is compiling right now and I'l leave it overnight but I have do doubts

oh dang, I should have saved it, I lost it on the WSL madness, is there a way to convert the installed egg into a whl?

Mar 30 '25 02:03 HDANILO

actually following this instruction and building from x64 Native Tools Command Prompt for VS 2022 helped https://huggingface.co/lldacing/flash-attention-windows-wheel

Mar 30 '25 03:03 siraxe

Coming back here to confirm, purged my WSL2 and created a new distro from scratch, this time I was able to compile everything correctly, there must have been some lingering .so on the WSL2 that even reinstalling the cuda-tool-kit was enough to offset the whole thing

Mar 31 '25 02:03 HDANILO

flash-attention flash-attention copied to clipboard

flash_attetion doesn't work on windows+wsl RTX 5090

flash-attention
flash-attention copied to clipboard