ao icon indicating copy to clipboard operation
ao copied to clipboard

Fixes MX formats build for blackwell

Open syed-ahmed opened this issue 7 months ago • 6 comments

When TORCH_CUDA_ARCH_LIST doesn't have 9.0a, setup.py first compiles the files in sources and then compiles the files in cutlass_90a_sources, overwriting the _C.abi3.so that was produced before:

    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/fp6_llm/fp6_linear.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 905; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 919; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1057; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1071; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    building 'torchao._C' extension
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x
    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/activation24/sparsify24.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao

As a result of the overwrite, we get NotImplementedError for CUDA backend for the MXFP4_CUTLASS tests. In this PR, we are proposing to have a separate _C_cutlass_90a, so that we don't overwrite _C.abi3.so. It also preserves the old behavior that only files that need 90a are compiled with 90a.

Similarly, the same is done for files that need to be compiled with 100a. Currently, if we don't compile mx_fp_cutlass_kernels with 100a, we see a Cutlass cannot run error on a SM 100 machine.

CC: @drisspg @vkuzo @msaroufim

syed-ahmed avatar May 15 '25 18:05 syed-ahmed

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2214

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit f4239154a428e787864e8bf6f16b5639bb0bd028 with merge base 554cb60c750e6ef31bbcafec74bb76a4578902da (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 15 '25 18:05 pytorch-bot[bot]

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

drisspg avatar May 15 '25 18:05 drisspg

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

I agree with that but it seems like currently there isn't a way to specify per source flags using torch cpp_extensions. There is some machinery in cmake: https://github.com/pytorch/pytorch/blob/2362bd4a4c9bb909eb8265d3ca9854b5ec07867e/cmake/Codegen.cmake#L93-L126 and it would be nice if we could do something similar in CUDAExtension.

syed-ahmed avatar May 15 '25 19:05 syed-ahmed

@syed-ahmed I got frustrated w/ cpp extensions and started working on: https://github.com/pytorch/ao/pull/1659 But never got the bandwidth to drive home

drisspg avatar May 15 '25 19:05 drisspg

Sounds good! Thanks.

syed-ahmed avatar May 19 '25 17:05 syed-ahmed

can you rebase this, we should land

drisspg avatar May 30 '25 05:05 drisspg

Closing, has been merged in https://github.com/pytorch/ao/pull/2278.

syed-ahmed avatar Jun 06 '25 20:06 syed-ahmed