Problems with the installation of flash_attn with pytorch 2.8.0+cu128
Hello, when installing flash_attn the latest version (2.8.2), it gives this problem"
RTX 5090, CUDA 12.8
Building wheel for flash_attn (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [341 lines of output]
torch.__version__ = 2.8.0+cu128
/venv/main/lib/python3.12/site-packages/setuptools/__init__.py:92: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!
********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
By 2025-Oct-31, you need to update your project and remove deprecated calls
or your builds will no longer be supported.
********************************************************************************
!!
dist.fetch_build_eggs(dist.setup_requires)
/venv/main/lib/python3.12/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated.
!!
********************************************************************************
Please consider removing the following classifiers in favor of a SPDX license expression:
License :: OSI Approved :: BSD License
See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
********************************************************************************
!!
self._finalize_license_expression()
running bdist_wheel
Guessing wheel URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.2/flash_attn-2.8.2+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
Precompiled wheel not found. Building from source...
running build
running build_py
creating build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-312/flash_attn
creating build/lib.linux-x86_64-cpython-312/hopper
copying hopper/__init__.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/benchmark_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/benchmark_flash_attention_fp8.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/benchmark_mla_decode.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/benchmark_split_kv.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/generate_kernels.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/padding.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/setup.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/test_attn_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/test_flash_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/test_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
copying hopper/test_util.py -> build/lib.linux-x86_64-cpython-312/hopper
creating build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bench.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bwd_prefill_fused.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bwd_prefill_onekernel.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bwd_prefill_split.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/bwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/fp8.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/fwd_decode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/fwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/fwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/interface_fa.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/test.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/train.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
copying flash_attn/flash_attn_triton_amd/utils.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
creating build/lib.linux-x86_64-cpython-312/flash_attn/layers
copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
creating build/lib.linux-x86_64-cpython-312/flash_attn/losses
copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
creating build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/baichuan.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/bigcode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/btlm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
creating build/lib.linux-x86_64-cpython-312/flash_attn/modules
copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
creating build/lib.linux-x86_64-cpython-312/flash_attn/ops
copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
creating build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/library.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/testing.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
copying flash_attn/utils/torch.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
creating build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
running build_ext
W0806 17:36:58.727000 1456 site-packages/torch/utils/cpp_extension.py:517] There are no g++ version bounds defined for CUDA version 12.8
building 'flash_attn_2_cuda' extension
creating /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn
creating /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src
[1/73] c++ -MMD -MF /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o.d -pthread -B /venv/main/compiler_compat -fno-strict-overflow -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /venv/main/include -fPIC -O2 -isystem /venv/main/include -fPIC -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
[2/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[3/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[4/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[5/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[6/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[7/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
Killed
[8/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
[9/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
Killed
[10/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
[20/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim32_bf16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
[21/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim64_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
[22/73] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o.d -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src -I/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/cutlass/include -I/venv/main/lib/python3.12/site-packages/torch/include -I/venv/main/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/venv/main/include/python3.12 -c -c /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.cu -o /tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1018"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
Killed
Killed
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/setup.py", line 485, in run
urllib.request.urlretrieve(wheel_url, wheel_filename)
File "/venv/main/lib/python3.12/urllib/request.py", line 240, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 521, in open
response = meth(req, response)
^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 630, in http_response
response = self.parent.error(
^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 559, in error
return self._call_chain(*args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/venv/main/lib/python3.12/urllib/request.py", line 639, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/venv/main/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2595, in _run_ninja_build
subprocess.run(
File "/venv/main/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '21']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 35, in <module>
File "/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/setup.py", line 525, in <module>
setup(
File "/venv/main/lib/python3.12/site-packages/setuptools/__init__.py", line 115, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in run_commands
dist.run_commands()
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in run_commands
self.run_command(cmd)
File "/venv/main/lib/python3.12/site-packages/setuptools/dist.py", line 1102, in run_command
super().run_command(command)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/tmp/pip-install-sbtoz0v8/flash-attn_da444315718f4881b4a57c626ba9d218/setup.py", line 502, in run
super().run()
File "/venv/main/lib/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in run
self.run_command("build")
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "/venv/main/lib/python3.12/site-packages/setuptools/dist.py", line 1102, in run_command
super().run_command(command)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "/venv/main/lib/python3.12/site-packages/setuptools/dist.py", line 1102, in run_command
super().run_command(command)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in run_command
cmd_obj.run()
File "/venv/main/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 96, in run
_build_ext.run(self)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 368, in run
self.build_extensions()
File "/venv/main/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1072, in build_extensions
build_ext.build_extensions(self)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 484, in build_extensions
self._build_extensions_serial()
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 510, in _build_extensions_serial
self.build_extension(ext)
File "/venv/main/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 261, in build_extension
_build_ext.build_extension(self, ext)
File "/venv/main/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line 565, in build_extension
objects = self.compiler.compile(
^^^^^^^^^^^^^^^^^^^^^^
File "/venv/main/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 856, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/venv/main/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2227, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/venv/main/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2612, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash_attn Running setup.py clean for flash_attn Failed to build flash_attn ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash_attn)
Similar problem when compiling using a docker image pytorch/pytorch_2.8.0-cuda12.6-cudnn9-devel
~/flash-attention/hopper$ python setup.py install
Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path '../csrc/cutlass'
Cloning into '/root/flash-attention/csrc/cutlass'...
Submodule path '../csrc/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b'
torch.__version__ = 2.8.0+cu126
downloading and extracting https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/linux-x86_64/cuda_nvcc-linux-x86_64-12.6.85-archive.tar.xz ...
copy /root/.flashattn/nvidia/nvcc/cuda_nvcc-linux-x86_64-12.6.85-archive/bin to /root/flash-attention/hopper/../third_party/nvidia/backend/bin ...
downloading and extracting https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/linux-x86_64/cuda_nvcc-linux-x86_64-12.8.93-archive.tar.xz ...
copy /root/.flashattn/nvidia/ptxas/cuda_nvcc-linux-x86_64-12.8.93-archive/bin/ptxas to /root/flash-attention/hopper/../third_party/nvidia/backend/bin ...
copy /root/.flashattn/nvidia/ptxas/cuda_nvcc-linux-x86_64-12.8.93-archive/nvvm/bin to /root/flash-attention/hopper/../third_party/nvidia/backend/nvvm/bin ...
/opt/conda/lib/python3.11/site-packages/setuptools/dist.py:334: InformationOnly: Normalizing '3.0.0.b1' to '3.0.0b1'
self.metadata.version = self._normalize_version(self.metadata.version)
/opt/conda/lib/python3.11/site-packages/setuptools/dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated.
!!
********************************************************************************
Please consider removing the following classifiers in favor of a SPDX license expression:
License :: OSI Approved :: Apache Software License
See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
********************************************************************************
!!
self._finalize_license_expression()
running install
/opt/conda/lib/python3.11/site-packages/setuptools/_distutils/cmd.py:90: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!!
self.initialize_options()
/opt/conda/lib/python3.11/site-packages/setuptools/_distutils/cmd.py:90: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` and ``easy_install``.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://github.com/pypa/setuptools/issues/917 for details.
********************************************************************************
!!
self.initialize_options()
running bdist_egg
running egg_info
creating flash_attn_3.egg-info
writing flash_attn_3.egg-info/PKG-INFO
writing dependency_links to flash_attn_3.egg-info/dependency_links.txt
writing requirements to flash_attn_3.egg-info/requires.txt
writing top-level names to flash_attn_3.egg-info/top_level.txt
writing manifest file 'flash_attn_3.egg-info/SOURCES.txt'
reading manifest file 'flash_attn_3.egg-info/SOURCES.txt'
writing manifest file 'flash_attn_3.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/lib.linux-x86_64-cpython-311
copying flash_attn_interface.py -> build/lib.linux-x86_64-cpython-311
running build_ext
W0807 01:33:42.344000 740 site-packages/torch/utils/cpp_extension.py:517] There are no g++ version bounds defined for CUDA version 12.6
building 'flash_attn_3._C' extension
creating /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311
creating /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations
[1/133] c++ -MMD -MF /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/flash_api.o.d -pthread -B /opt/conda/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/include -fPIC -O2 -isystem /opt/conda/include -fPIC -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/flash_api.cpp -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/flash_api.o -O3 -std=c++17 -DPy_LIMITED_API=0x03090000 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
[2/133] /root/flash-attention/third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/flash_prepare_scheduler.o.d -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/flash_prepare_scheduler.cu -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/flash_prepare_scheduler.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 2 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode arch=compute_80,code=sm_80
ptxas info : 9 bytes gmem, 72 bytes cmem[4]
ptxas info : Compiling entry function '_ZN5flash32prepare_varlen_num_blocks_kernelEiiiPKiS1_S1_S1_S1_S1_iiiiiN7cutlass10FastDivmodES3_PiS4_b' for 'sm_80'
ptxas info : Function properties for _ZN5flash32prepare_varlen_num_blocks_kernelEiiiPKiS1_S1_S1_S1_S1_iiiiiN7cutlass10FastDivmodES3_PiS4_b
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, used 1 barriers, 4 bytes smem, 481 bytes cmem[0]
ptxas info : Compile time = 15.744 ms
ptxas info : 9 bytes gmem
ptxas info : Compiling entry function '_ZN5flash32prepare_varlen_num_blocks_kernelEiiiPKiS1_S1_S1_S1_S1_iiiiiN7cutlass10FastDivmodES3_PiS4_b' for 'sm_90a'
ptxas info : Function properties for _ZN5flash32prepare_varlen_num_blocks_kernelEiiiPKiS1_S1_S1_S1_S1_iiiiiN7cutlass10FastDivmodES3_PiS4_b
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 13 registers, used 1 barriers, 4 bytes smem
ptxas info : Compile time = 219.001 ms
[3/133] /root/flash-attention/third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.o.d -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.cu -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 2 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_80,code=sm_80 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
FAILED: /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.o
/root/flash-attention/third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.o.d -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.cu -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_bwd_hdim96_fp16_softcap_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 2 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_80,code=sm_80 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
Killed
[4/133] /root/flash-attention/third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.o.d -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.cu -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 2 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
FAILED: /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.o
/root/flash-attention/third_party/nvidia/backend/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.o.d -I/root/flash-attention/hopper -I/root/flash-attention/csrc/cutlass/include -I/opt/conda/lib/python3.11/site-packages/torch/include -I/opt/conda/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/include/python3.11 -c -c /root/flash-attention/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.cu -o /root/flash-attention/hopper/build/temp.linux-x86_64-cpython-311/instantiations/flash_fwd_hdimdiff_bf16_paged_sm90.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' --threads 2 -O3 -std=c++17 --ftemplate-backtrace-limit=0 --use_fast_math --resource-usage -lineinfo -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_GDC_FOR_SM90 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -DNDEBUG -gencode arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
Killed
I had a similar issue. Seems it could be pytorch version mismatch.
I fixed the version to 2.7.1 like
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
and it solved the issue
Pytorch 2.8 was just released and ComfyUI uses that by default, so it would be great if we could get some pytorch 2.8 wheels for this.
Plus 1 on this. Just upgraded to torch 2.8/cuda 12.8, and it would be greatly appreciated to have wheels.
Seems like this other repo might have some soon I'm hoping: https://github.com/mjun0812/flash-attention-prebuild-wheels/issues/28#issuecomment-3172385213
Same issue, latest pytorch (2.8) default cuda (12.8), fresh install - can't install flash attention 2. Env: Python 3.12, WSL2.
uv pip install torch torchvision && uv pip install keras keras-rs && uv pip i nstall flash-attn --no-build-isolation && apt install git-lfs Using Python 3.12.11 environment at: /home/ddofer/anaconda3/envs/dna Audited 2 packages in 64ms Using Python 3.12.11 environment at: /home/ddofer/anaconda3/envs/dna Audited 2 packages in 6ms Using Python 3.12.11 environment at: /home/ddofer/anaconda3/envs/dna Resolved 27 packages in 357ms × Failed to buildflash-attn==2.8.2├─▶ The build backend returned an error ╰─▶ Call tosetuptools.build_meta:legacy.build_wheel` failed (exit status: 1)
[stdout]
torch.__version__ = 2.8.0+cu128
running bdist_wheel
Guessing wheel URL:
https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.2/flash_attn-2.8.2+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
Precompiled wheel not found. Building from source...
[...]
W0812 12:25:35.984000 9184 site-packages/torch/utils/cpp_extension.py:517] There are no g++ version bounds defined for
CUDA version 12.8
Traceback (most recent call last):
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2595, in
_run_ninja_build
subprocess.run(
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '4']' returned non-zero exit status 255.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 11, in <module>
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/build_meta.py", line 435, in build_wheel
return _build(['bdist_wheel'])
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/build_meta.py", line 426, in _build
return self._build_with_temp_dir(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/build_meta.py", line 407, in
_build_with_temp_dir
self.run_setup()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/build_meta.py", line 522, in run_setup
super().run_setup(setup_script=setup_script)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/build_meta.py", line 320, in run_setup
exec(code, locals())
File "<string>", line 525, in <module>
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/__init__.py", line 117, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 186, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/core.py", line 202, in
run_commands
dist.run_commands()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1002, in
run_commands
self.run_command(cmd)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
super().run_command(command)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in
run_command
cmd_obj.run()
File "<string>", line 502, in run
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/command/bdist_wheel.py", line 370, in
run
self.run_command("build")
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in
run_command
self.distribution.run_command(command)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
super().run_command(command)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in
run_command
cmd_obj.run()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/command/build.py", line 135,
in run
self.run_command(cmd_name)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/cmd.py", line 357, in
run_command
self.distribution.run_command(command)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/dist.py", line 1104, in run_command
super().run_command(command)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/dist.py", line 1021, in
run_command
cmd_obj.run()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 99, in run
_build_ext.run(self)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line
368, in run
self.build_extensions()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1072, in
build_extensions
build_ext.build_extensions(self)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line
484, in build_extensions
self._build_extensions_serial()
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line
510, in _build_extensions_serial
self.build_extension(ext)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/command/build_ext.py", line 264, in
build_extension
_build_ext.build_extension(self, ext)
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py", line
565, in build_extension
objects = self.compiler.compile(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 856, in
unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2227, in
_write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/ddofer/anaconda3/envs/dna/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2612, in
_run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
hint: This usually indicates a problem with the package or the build environment.
`
I had a similar issue. Seems it could be pytorch version mismatch. I fixed the version to 2.7.1 like
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128and it solved the issue
Unfortunately, this did not work for me. I get this error when running:
python3.11/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module>
import flash_attn_2_cuda as flash_attn_gpu
ImportError: /lambda/nfs/rili/rl_venv/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
I had a similar issue. Seems it could be pytorch version mismatch. I fixed the version to 2.7.1 like
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128我也有类似的问题。似乎可能是 pytorch 版本不匹配。我将版本修复为 2.7.1,就像pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128and it solved the issue它解决了这个问题Unfortunately, this did not work for me. I get this error when running:不幸的是,这对我不起作用。我在运行时收到此错误:
python3.11/site-packages/flash_attn/flash_attn_interface.py", line 15, in <module> import flash_attn_2_cuda as flash_attn_gpu ImportError: /lambda/nfs/rili/rl_venv/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
me too
One more thing to consider regarding this issue. You most probably will have problem installing flash_attn 2.8 on Ubuntu 2004 and below because of glibc error:
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/user/.local/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so)
Thus even if you'll downgrade torch to 2.7 on Ubuntu 2204 you won't be able to run flash_attn 2.8.0 or higher.
We have new wheels for torch 2.8 now
I got the similar error when installing v2.8.3 with pytorch 2.8.0.
env:
=== System Information ===
OS: Windows-11-10.0.26100-SP0
Python: 3.12.10 (CPython)
Executable: C:\sources\Qwen1\.venv\Scripts\python.exe
=== PyTorch Information ===
PyTorch version: 2.8.0+cu129
Debug build: False
CUDA available: True
CUDA version (from torch): 12.9
cuDNN enabled: True
cuDNN version: 91002
GPU count: 1
- GPU 0: NVIDIA GeForce RTX 5080
=== pip freeze ===
accelerate==1.10.0
aiofiles==24.1.0
annotated-types==0.7.0
anyio==4.10.0
Brotli==1.1.0
certifi==2025.8.3
charset-normalizer==3.4.3
click==8.2.1
colorama==0.4.6
einops==0.8.1
fastapi==0.116.1
ffmpy==0.6.1
filelock==3.13.1
frida==16.7.19
frida-tools==13.7.1
fsspec==2024.6.1
gradio==5.42.0
gradio_client==1.11.1
groovy==0.1.2
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
huggingface-hub==0.34.4
idna==3.10
Jinja2==3.1.4
latex2mathml==3.78.0
Markdown==3.8.2
markdown-it-py==4.0.0
MarkupSafe==2.1.5
mdtex2html==1.3.1
mdurl==0.1.2
mpmath==1.3.0
networkx==3.3
numpy==2.1.2
orjson==3.11.2
packaging==25.0
pandas==2.3.1
pillow==11.0.0
psutil==7.0.0
pydantic==2.11.7
pydantic_core==2.33.2
pydub==0.25.1
Pygments==2.19.2
python-dateutil==2.9.0.post0
python-multipart==0.0.20
pytz==2025.2
PyYAML==6.0.2
regex==2025.7.34
requests==2.32.4
rich==14.1.0
ruff==0.12.9
safehttpx==0.1.6
safetensors==0.6.2
scipy==1.16.1
semantic-version==2.10.0
setuptools==80.9.0
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
starlette==0.47.2
sympy==1.13.3
tiktoken==0.11.0
tokenizers==0.21.4
tomlkit==0.13.3
torch==2.8.0+cu129
torchvision==0.23.0+cu129
tqdm==4.67.1
transformers==4.55.2
transformers-stream-generator==0.0.5
typer==0.16.0
typing-inspection==0.4.1
typing_extensions==4.12.2
tzdata==2025.2
urllib3==2.5.0
uvicorn==0.35.0
websockets==15.0.1
wheel==0.40.0
error:
python setup.py install
Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'csrc/composable_kernel'
Cloning into 'C:/sources/Qwen1/flash-attention/csrc/composable_kernel'...
Submodule path 'csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb'
Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'csrc/cutlass'
Cloning into 'C:/sources/Qwen1/flash-attention/csrc/cutlass'...
Submodule path 'csrc/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b'
torch.__version__ = 2.8.0+cu129
C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\__init__.py:92: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!
********************************************************************************
Requirements should be satisfied by a PEP 517 installer.
If you are using pip, you can try `pip install --use-pep517`.
By 2025-Oct-31, you need to update your project and remove deprecated calls
or your builds will no longer be supported.
********************************************************************************
!!
dist.fetch_build_eggs(dist.setup_requires)
C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\dist.py:759: SetuptoolsDeprecationWarning: License classifiers are deprecated.
!!
********************************************************************************
Please consider removing the following classifiers in favor of a SPDX license expression:
License :: OSI Approved :: BSD License
See https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license for details.
********************************************************************************
!!
self._finalize_license_expression()
running install
C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\cmd.py:90: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!
********************************************************************************
Please avoid running ``setup.py`` directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
By 2025-Oct-31, you need to update your project and remove deprecated calls
or your builds will no longer be supported.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!!
self.initialize_options()
running build
running build_py
creating build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\bert_padding.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\flash_attn_interface.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\flash_attn_triton.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\flash_attn_triton_og.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\flash_blocksparse_attention.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\flash_blocksparse_attn_interface.py -> build\lib.win-amd64-cpython-312\flash_attn
copying flash_attn\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn
creating build\lib.win-amd64-cpython-312\hopper
copying hopper\benchmark_attn.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\benchmark_flash_attention_fp8.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\benchmark_mla_decode.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\benchmark_split_kv.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\flash_attn_interface.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\generate_kernels.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\padding.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\setup.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\test_attn_kvcache.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\test_flash_attn.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\test_kvcache.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\test_util.py -> build\lib.win-amd64-cpython-312\hopper
copying hopper\__init__.py -> build\lib.win-amd64-cpython-312\hopper
creating build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\ampere_helpers.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\blackwell_helpers.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\block_info.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\fast_math.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\flash_bwd.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\flash_bwd_postprocess.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\flash_bwd_preprocess.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\flash_fwd.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\flash_fwd_sm100.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\hopper_helpers.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\interface.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\mask.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\mma_sm100_desc.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\named_barrier.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\pack_gqa.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\pipeline.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\seqlen_info.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\softmax.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\tile_scheduler.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\utils.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
copying flash_attn\cute\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\cute
creating build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bench.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bwd_prefill.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bwd_prefill_fused.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bwd_prefill_onekernel.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bwd_prefill_split.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\bwd_ref.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\fp8.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\fwd_decode.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\fwd_prefill.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\fwd_ref.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\interface_fa.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\test.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\train.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\utils.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
copying flash_attn\flash_attn_triton_amd\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\flash_attn_triton_amd
creating build\lib.win-amd64-cpython-312\flash_attn\layers
copying flash_attn\layers\patch_embed.py -> build\lib.win-amd64-cpython-312\flash_attn\layers
copying flash_attn\layers\rotary.py -> build\lib.win-amd64-cpython-312\flash_attn\layers
copying flash_attn\layers\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\layers
creating build\lib.win-amd64-cpython-312\flash_attn\losses
copying flash_attn\losses\cross_entropy.py -> build\lib.win-amd64-cpython-312\flash_attn\losses
copying flash_attn\losses\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\losses
creating build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\baichuan.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\bert.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\bigcode.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\btlm.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\falcon.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\gpt.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\gptj.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\gpt_neox.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\llama.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\opt.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\vit.py -> build\lib.win-amd64-cpython-312\flash_attn\models
copying flash_attn\models\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\models
creating build\lib.win-amd64-cpython-312\flash_attn\modules
copying flash_attn\modules\block.py -> build\lib.win-amd64-cpython-312\flash_attn\modules
copying flash_attn\modules\embedding.py -> build\lib.win-amd64-cpython-312\flash_attn\modules
copying flash_attn\modules\mha.py -> build\lib.win-amd64-cpython-312\flash_attn\modules
copying flash_attn\modules\mlp.py -> build\lib.win-amd64-cpython-312\flash_attn\modules
copying flash_attn\modules\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\modules
creating build\lib.win-amd64-cpython-312\flash_attn\ops
copying flash_attn\ops\activations.py -> build\lib.win-amd64-cpython-312\flash_attn\ops
copying flash_attn\ops\fused_dense.py -> build\lib.win-amd64-cpython-312\flash_attn\ops
copying flash_attn\ops\layer_norm.py -> build\lib.win-amd64-cpython-312\flash_attn\ops
copying flash_attn\ops\rms_norm.py -> build\lib.win-amd64-cpython-312\flash_attn\ops
copying flash_attn\ops\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\ops
creating build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\benchmark.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\distributed.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\generation.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\library.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\pretrained.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\testing.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\torch.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
copying flash_attn\utils\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\utils
creating build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\cross_entropy.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\k_activations.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\layer_norm.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\linear.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\mlp.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\rotary.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
copying flash_attn\ops\triton\__init__.py -> build\lib.win-amd64-cpython-312\flash_attn\ops\triton
running build_ext
W0816 22:43:02.535000 35588 Lib\site-packages\torch\utils\cpp_extension.py:466] Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
building 'flash_attn_2_cuda' extension
creating C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn
creating C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src
W0816 22:43:03.664000 35588 Lib\site-packages\torch\utils\cpp_extension.py:466] Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\_msvccompiler.py:12: UserWarning: _get_vc_env is private; find an alternative (pypa/distutils#340)
warnings.warn(
[1/73] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc --generate-dependencies-with-compile --dependency-output C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flas
h_bwd_hdim128_bf16_causal_sm80.obj.d -std=c++17 -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /w
d4067 -Xcompiler /wd4068 -Xcompiler /EHsc --use-local-env -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_
assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\
include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Q
wen1\.venv\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14
.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86
)\Windows Kits\10\include\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10
\\include\10.0.26100.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\fl
ash_bwd_hdim128_bf16_causal_sm80.cu -o C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_hdim128_bf16_causal_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSION
S__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSI
ONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: [code=4294967295] C:/sources/Qwen1/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc --generate-dependencies-with-compile --dependency-output C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_h
dim128_bf16_causal_sm80.obj.d -std=c++17 -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -
Xcompiler /wd4068 -Xcompiler /EHsc --use-local-env -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed
-Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\include
-IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Qwen1\.v
env\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.331
30\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windo
ws Kits\10\include\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\inclu
de\10.0.26100.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\flash_bwd
_hdim128_bf16_causal_sm80.cu -o C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_hdim128_bf16_causal_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D_
_CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -
-expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
flash_bwd_hdim128_bf16_causal_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_causal_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_causal_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_causal_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_causal_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_causal_sm80.cu
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: namespace "cutlass::platform" has no member "is_unsigned_v"
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: type name is not allowed
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: expected an expression
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(76): warning #221-D: floating-point value does not fit in required floating-point type
const float max_scaled = max(mi) == -((float)(1e+300)) ? 0.f : max(mi) * (Scale_max ? scale : float(1.44269504088896340736));
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(111): warning #221-D: floating-point value does not fit in required floating-point type
const float max_scaled = max(mi) == -((float)(1e+300)) ? 0.f : max(mi) * scale;
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(156): warning #221-D: floating-point value does not fit in required floating-point type
: (row_max(mi) == -((float)(1e+300)) ? 0.0f : row_max(mi));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(180): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = (sum == 0.f || sum != sum) ? (Split ? -((float)(1e+300)) : ((float)(1e+300))) : row_max(mi) * softmax_scale + __logf(sum);
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(180): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = (sum == 0.f || sum != sum) ? (Split ? -((float)(1e+300)) : ((float)(1e+300))) : row_max(mi) * softmax_scale + __logf(sum);
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(31): warning #221-D: floating-point value does not fit in required floating-point type
tensor(mi, make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(62): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(100): warning #221-D: floating-point value does not fit in required floating-point type
tensor(mi, ni) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(160): warning #221-D: floating-point value does not fit in required floating-point type
if (col_idx >= max_seqlen_k) { tensor(mi, make_coord(j, nj)) = -((float)(1e+300)); }
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(190): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(195): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(201): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\flash_bwd_kernel.h(409): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = Is_even_MN || row < binfo.actual_seqlen_q - m_block * kBlockM ? gLSE(row) : ((float)(1e+300));
^
3 errors detected in the compilation of "C:/sources/Qwen1/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu".
[2/73] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc --generate-dependencies-with-compile --dependency-output C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flas
h_bwd_hdim128_bf16_sm80.obj.d -std=c++17 -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -
Xcompiler /wd4068 -Xcompiler /EHsc --use-local-env -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed
-Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\include
-IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Qwen1\.v
env\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.331
30\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windo
ws Kits\10\include\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\inclu
de\10.0.26100.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\flash_bwd
_hdim128_bf16_sm80.cu -o C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_hdim128_bf16_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOA
T16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-
constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
FAILED: [code=4294967295] C:/sources/Qwen1/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc --generate-dependencies-with-compile --dependency-output C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_h
dim128_bf16_sm80.obj.d -std=c++17 -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompil
er /wd4068 -Xcompiler /EHsc --use-local-env -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcuda
fe --diag_suppress=dll_interface_conflict_dllexport_assumed -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\include -IC:\s
ources\Qwen1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Qwen1\.venv\inc
lude -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\incl
ude" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits
\10\include\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0
.26100.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\flash_bwd_hdim12
8_bf16_sm80.cu -o C:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\src\flash_bwd_hdim128_bf16_sm80.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CON
VERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constex
pr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda
flash_bwd_hdim128_bf16_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_sm80.cu
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_OPERATORS__' with '/U__CUDA_NO_HALF_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF_CONVERSIONS__' with '/U__CUDA_NO_HALF_CONVERSIONS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_HALF2_OPERATORS__' with '/U__CUDA_NO_HALF2_OPERATORS__'
cl : Command line warning D9025 : overriding '/D__CUDA_NO_BFLOAT16_CONVERSIONS__' with '/U__CUDA_NO_BFLOAT16_CONVERSIONS__'
flash_bwd_hdim128_bf16_sm80.cu
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: namespace "cutlass::platform" has no member "is_unsigned_v"
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: type name is not allowed
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:/sources/Qwen1/flash-attention/csrc/cutlass/include\cutlass/exmy_base.h(404): error: expected an expression
static_assert(cutlass::platform::is_unsigned_v<Storage>, "Use an unsigned integer for StorageType");
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(76): warning #221-D: floating-point value does not fit in required floating-point type
const float max_scaled = max(mi) == -((float)(1e+300)) ? 0.f : max(mi) * (Scale_max ? scale : float(1.44269504088896340736));
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(111): warning #221-D: floating-point value does not fit in required floating-point type
const float max_scaled = max(mi) == -((float)(1e+300)) ? 0.f : max(mi) * scale;
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(156): warning #221-D: floating-point value does not fit in required floating-point type
: (row_max(mi) == -((float)(1e+300)) ? 0.0f : row_max(mi));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(180): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = (sum == 0.f || sum != sum) ? (Split ? -((float)(1e+300)) : ((float)(1e+300))) : row_max(mi) * softmax_scale + __logf(sum);
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\softmax.h(180): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = (sum == 0.f || sum != sum) ? (Split ? -((float)(1e+300)) : ((float)(1e+300))) : row_max(mi) * softmax_scale + __logf(sum);
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(31): warning #221-D: floating-point value does not fit in required floating-point type
tensor(mi, make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(62): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(100): warning #221-D: floating-point value does not fit in required floating-point type
tensor(mi, ni) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(160): warning #221-D: floating-point value does not fit in required floating-point type
if (col_idx >= max_seqlen_k) { tensor(mi, make_coord(j, nj)) = -((float)(1e+300)); }
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(190): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(195): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\mask.h(201): warning #221-D: floating-point value does not fit in required floating-point type
tensor(make_coord(i, mi), make_coord(j, nj)) = -((float)(1e+300));
^
C:\sources\Qwen1\flash-attention\csrc\flash_attn\src\flash_bwd_kernel.h(409): warning #221-D: floating-point value does not fit in required floating-point type
lse(mi) = Is_even_MN || row < binfo.actual_seqlen_q - m_block * kBlockM ? gLSE(row) : ((float)(1e+300));
^
3 errors detected in the compilation of "C:/sources/Qwen1/flash-attention/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu".
[3/73] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\include -IC:\sour
ces\Qwen1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Qwen1\.venv\includ
e -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\include
" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10
\include\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26
100.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067
/wd4068 /EHsc -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\flash_api.cpp /FoC:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\flash_api.obj -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda /std:c++17
FAILED: [code=2] C:/sources/Qwen1/flash-attention/build/temp.win-amd64-cpython-312/Release/csrc/flash_attn/flash_api.obj
cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\sources\Qwen1\flash-attention\csrc\flash_attn -IC:\sources\Qwen1\flash-attention\csrc\flash_attn\src -IC:\sources\Qwen1\flash-attention\csrc\cutlass\include -IC:\sources\Qwe
n1\.venv\Lib\site-packages\torch\include -IC:\sources\Qwen1\.venv\Lib\site-packages\torch\include\torch\csrc\api\include "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\include" -IC:\sources\Qwen1\.venv\include -IC:\
Users\fengluo\AppData\Local\Programs\Python\Python312\include -IC:\Users\fengluo\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\include" "-IC:
\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\includ
e\10.0.26100.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\
winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.26100.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd406
8 /EHsc -c C:\sources\Qwen1\flash-attention\csrc\flash_attn\flash_api.cpp /FoC:\sources\Qwen1\flash-attention\build\temp.win-amd64-cpython-312\Release\csrc\flash_attn\flash_api.obj -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=flash_attn_2_cuda /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-std=c++17'
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2039: 'is_unsigned_v': is not a member of 'cutlass::platform'
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/integer_subbyte.h(235): note: see declaration of 'cutlass::platform'
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): note: the template instantiation context (the oldest one first) is
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(1211): note: see reference to class template instantiation 'cutlass::float_exmy_base<T,Derived>' being compiled
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(950): note: see reference to function template instantiation 'auto cutlass::detail::fp_encoding_selector<cutlass::detail::FpEncoding::E8M23>(void)' being compiled
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(860): note: see reference to class template instantiation 'cutlass::detail::FpBitRepresentation<uint32_t,32,8,23,cutlass::detail::NanInfEncoding::IEEE_754,true>' being compiled
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2065: 'is_unsigned_v': undeclared identifier
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint32_t,32,8,23,cutlass::detail::NanInfEncoding::IEEE_754,true>::Storage': expected an expression instead of a type
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2059: syntax error: ','
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2238: unexpected token(s) preceding ';'
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint8_t,8,4,3,cutlass::detail::NanInfEncoding::CANONICAL_ONLY,false>::Storage': expected an expression instead of a type
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint8_t,8,8,0,cutlass::detail::NanInfEncoding::CANONICAL_ONLY,false>::Storage': expected an expression instead of a type
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint8_t,4,2,1,cutlass::detail::NanInfEncoding::NONE,true>::Storage': expected an expression instead of a type
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint8_t,6,2,3,cutlass::detail::NanInfEncoding::NONE,true>::Storage': expected an expression instead of a type
C:\sources\Qwen1\flash-attention\csrc\cutlass\include\cutlass/exmy_base.h(404): error C2275: 'cutlass::detail::FpBitRepresentation<uint8_t,6,3,2,cutlass::detail::NanInfEncoding::NONE,true>::Storage': expected an expression instead of a type
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "C:\sources\Qwen1\.venv\Lib\site-packages\torch\utils\cpp_extension.py", line 2595, in _run_ninja_build
subprocess.run(
File "C:\Users\fengluo\AppData\Local\Programs\Python\Python312\Lib\subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '3']' returned non-zero exit status 2.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\sources\Qwen1\flash-attention\setup.py", line 526, in <module>
setup(
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\__init__.py", line 115, in setup
return distutils.core.setup(**attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\core.py", line 186, in setup
return run_commands(dist)
^^^^^^^^^^^^^^^^^^
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\core.py", line 202, in run_commands
dist.run_commands()
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\dist.py", line 1002, in run_commands
self.run_command(cmd)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\dist.py", line 1102, in run_command
super().run_command(command)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command
cmd_obj.run()
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\install.py", line 689, in run
self.run_command('build')
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\dist.py", line 1102, in run_command
super().run_command(command)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command
cmd_obj.run()
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\build.py", line 135, in run
self.run_command(cmd_name)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\cmd.py", line 357, in run_command
self.distribution.run_command(command)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\dist.py", line 1102, in run_command
super().run_command(command)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\dist.py", line 1021, in run_command
cmd_obj.run()
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\command\build_ext.py", line 96, in run
_build_ext.run(self)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\build_ext.py", line 368, in run
self.build_extensions()
File "C:\sources\Qwen1\.venv\Lib\site-packages\torch\utils\cpp_extension.py", line 1072, in build_extensions
build_ext.build_extensions(self)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\build_ext.py", line 484, in build_extensions
self._build_extensions_serial()
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\build_ext.py", line 510, in _build_extensions_serial
self.build_extension(ext)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\command\build_ext.py", line 261, in build_extension
_build_ext.build_extension(self, ext)
File "C:\Users\fengluo\AppData\Roaming\Python\Python312\site-packages\setuptools\_distutils\command\build_ext.py", line 565, in build_extension
objects = self.compiler.compile(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\sources\Qwen1\.venv\Lib\site-packages\torch\utils\cpp_extension.py", line 1041, in win_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "C:\sources\Qwen1\.venv\Lib\site-packages\torch\utils\cpp_extension.py", line 2227, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "C:\sources\Qwen1\.venv\Lib\site-packages\torch\utils\cpp_extension.py", line 2612, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
Seems like this other repo might have some soon I'm hoping: mjun0812/flash-attention-prebuild-wheels#28 (comment)
thanks, it works!
This worked for me
flash_attn==2.7.3
torch==2.6.0
transformers==4.55.2
You can also try https://github.com/Dao-AILab/flash-attention/discussions/1838 and see if it works for you.
Using pip install "flash_attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" worked for me (after installing pytorch 2.8 with CUDA 12.8 support)
flash_attn==2.7.3 torch==2.6.0 transformers==4.55.2
Can you tell torchvision version? cuda and python versions too..
For me this works, I used docker image pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel from dockerhub
and built my container to use it in different GPUs including RTX 5090. Maybe this helps
ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;12.0+PTX" \ CUDAARCHS="80;86;89;90" \ FLASH_ATTENTION_FORCE_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;12.0"
while building flash-attn from source
RUN FORCE_CUDA=1 TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}" \ CUDAARCHS="${CUDAARCHS}" \ FLASH_ATTENTION_FORCE_CUDA_ARCH_LIST="${FLASH_ATTENTION_FORCE_CUDA_ARCH_LIST}" \ pip install --no-binary flash-attn flash-attn==2.8.3 --no-build-isolation
Using
pip install "flash_attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"worked for me (after installing pytorch 2.8 with CUDA 12.8 support)
It works me too!