ao Fixes MX formats build for blackwell

When TORCH_CUDA_ARCH_LIST doesn't have 9.0a, setup.py first compiles the files in sources and then compiles the files in cutlass_90a_sources, overwriting the _C.abi3.so that was produced before:

    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/fp6_llm/fp6_linear.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 905; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 919; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1057; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    ptxas /tmp/tmpxft_00016e89_00000000-6_mx_fp_cutlass_kernels.compute_90.ptx, line 1071; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_100f/sm_101a/sm_101f/sm_103a/sm_103f' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/fp6_llm/fp6_linear.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/marlin_qqq/marlin_qqq_kernel.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/mx_kernels/mx_fp_cutlass_kernels.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s4s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_cutlass/rowwise_scaled_linear_cutlass_s8s4.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/sparse_marlin/marlin_kernel_nm.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    building 'torchao._C' extension
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass
    creating /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x
    [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [2/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/activation24/sparsify24.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [3/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [4/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [5/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [6/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    [7/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o.d -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /opt/pytorch/ao/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.cu -o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -t=0 -std=c++17 -DTORCHAO_USE_CUTLASS -I/opt/pytorch/ao/third_party/cutlass/include -I/opt/pytorch/ao/third_party/cutlass/tools/util/include -I/opt/pytorch/ao/torchao/csrc/cuda -DCUTE_USE_PACKED_TUPLE=1 -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --ftemplate-backtrace-limit=0 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_100a,code=sm_100a -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_90a,code=sm_90a -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DTORCH_EXTENSION_NAME=_C
    aarch64-linux-gnu-g++ -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -shared -Wl,-O1 -Wl,-Bsymbolic-functions /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/activation24/sparsify24.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e4m3e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e4m3.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_e5m2e5m2.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/rowwise_scaled_linear_sparse_cutlass/rowwise_scaled_linear_sparse_cutlass_f8f8.o /opt/pytorch/ao/build/temp.linux-aarch64-cpython-312/torchao/csrc/cuda/to_sparse_semi_structured_cutlass_sm9x/to_sparse_semi_structured_cutlass_sm9x_f8.o -L/usr/local/lib/python3.12/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/aarch64-linux-gnu -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao
    copying build/lib.linux-aarch64-cpython-312/torchao/_C.abi3.so -> torchao

As a result of the overwrite, we get NotImplementedError for CUDA backend for the MXFP4_CUTLASS tests. In this PR, we are proposing to have a separate _C_cutlass_90a, so that we don't overwrite _C.abi3.so. It also preserves the old behavior that only files that need 90a are compiled with 90a.

Similarly, the same is done for files that need to be compiled with 100a. Currently, if we don't compile mx_fp_cutlass_kernels with 100a, we see a Cutlass cannot run error on a SM 100 machine.

CC: @drisspg @vkuzo @msaroufim

May 15 '25 18:05 syed-ahmed

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2214

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit f4239154a428e787864e8bf6f16b5639bb0bd028 with merge base 554cb60c750e6ef31bbcafec74bb76a4578902da (): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

May 15 '25 18:05 pytorch-bot[bot]

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

May 15 '25 18:05 drisspg

This meeans we end up shipping 3 more .so s right? I wonder if ther isn't a better way to do this by including the files in the extension module but allowing per source flags

I agree with that but it seems like currently there isn't a way to specify per source flags using torch cpp_extensions. There is some machinery in cmake: https://github.com/pytorch/pytorch/blob/2362bd4a4c9bb909eb8265d3ca9854b5ec07867e/cmake/Codegen.cmake#L93-L126 and it would be nice if we could do something similar in CUDAExtension.

May 15 '25 19:05 syed-ahmed

@syed-ahmed I got frustrated w/ cpp extensions and started working on: https://github.com/pytorch/ao/pull/1659 But never got the bandwidth to drive home

May 15 '25 19:05 drisspg

Sounds good! Thanks.

May 19 '25 17:05 syed-ahmed

can you rebase this, we should land

May 30 '25 05:05 drisspg

Closing, has been merged in https://github.com/pytorch/ao/pull/2278.

Jun 06 '25 20:06 syed-ahmed