How to use float8 with SM89 hardware - i.e. NVIDIA A6000 ADA?
I am running torchao: 0.5 and torch: '2.5.0a0+b465a5843b.nv24.09' on an NVIDIA A6000 ADA card (sm89) which supports FP8.
I ran the generate.py code from the benchmark:
python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --write_result /root/benchmark_results__baseline.txt
Average tokens/sec: 57.01 Average Bandwidth: 855.74 GB/s Peak Memory Usage: 16.19 GB Model Size: 15.01 GB
20241011143042, tok/s= 57.01, mem/s= 855.74 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: None, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8
python generate.py --checkpoint_path $CHECKPOINT_PATH --compile --compile_prefill --quantization float8wo --write_result /root/benchmark_results__float8wo.txt`
Average tokens/sec: 57.00 Average Bandwidth: 855.62 GB/s Peak Memory Usage: 16.19 GB Model Size: 15.01 GB
20241011143316, tok/s= 57.00, mem/s= 855.62 GB/s, peak_mem=16.19 GB, model_size=15.01 GB quant: float8wo, mod: Meta-Llama-3-8B, kv_quant: False, compile: True, compile_prefill: True, dtype: torch.bfloat16, device: cuda repro: python generate.py --quantization float8wo --checkpoint_path /models/Meta-Llama-3-8B/consolidated.00.pth --device cuda --precision torch.bfloat16 --compile --compile_prefill --num_samples 5 --max_new_tokens 200 --top_k 200 --temperature 0.8
The float8wo flag does not appear to be doing anything. Am I missing a step? Thanks!
Updated torch to the nightly, same results as above...
What kind of model is this? If it's a memory bound workload like small batch size inference I don't suspect fp8-wo to have a dramatic impact and it might be best to try float8 dynamic quant.
From twitter it seems like you're interested in exploring distributed training in which case you can find how to run training benchmarks https://github.com/pytorch/ao/tree/main/torchao/float8
EDIT: It seems like @vgoklani is interested in both distributed training and inference and they're observing a speedup at small batch size with TE and they also see a speedup with torch._scaled_mm in which case my best guess is this is just poor autotuning on compile or the right inductor flags werent set
Now that: https://github.com/pytorch/pytorch/pull/144348 has landed, thanks @alexsamardzic we should have support for this. I am going to close this for now but if you run into any issues @vgoklani feel free to reopen
Hi @drisspg thanks for following up! There are a few different open issues, what exactly does this solve: f8f8bf16_rowwise()
Is this for supporting row-wise scaling on SM89 (Ada)? Is there a quick test I could run to validate? Thanks!
Hi @drisspg thanks for following up! There are a few different open issues, what exactly does this solve:
f8f8bf16_rowwise()Is this for supporting row-wise scaling on SM89 (Ada)? Is there a quick test I could run to validate? Thanks!
Yes, it should support row-wise scaling on SM89; you can try following to validate it:
pytest -k test_tensorwise_scaling test/inductor/test_fp8.py
thanks @alexsamardzic
I ran this on the latest nightly - torch.version == 2.7.0.dev20250122+cu126
FAILED [1.3825s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_1024_K_16_N_16_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [2.9496s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_1024_K_16_N_2048_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.9715s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_1_K_16_N_2048_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.3421s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_257_K_16_N_16_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [3.1700s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_257_K_16_N_2048_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.3636s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_33_K_16_N_16_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [3.1565s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_33_K_16_N_2048_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.1833s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_3_K_16_N_16_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [2.0704s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_acceptable_input_dims_M_3_K_16_N_2048_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [6.3882s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_1024,1024,512_has_bias_False_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [6.5399s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_1024,1024,512_has_bias_True_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.1952s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_16,16,32_has_bias_False_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.4239s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_16,16,32_has_bias_False_use_fast_accum_True_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.2504s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_16,16,32_has_bias_True_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.3897s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_bfloat16_shape_16,16,32_has_bias_True_use_fast_accum_True_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [6.7536s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_float32_shape_1024,1024,512_has_bias_False_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.1833s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_float32_shape_16,16,32_has_bias_False_use_fast_accum_False_persistent_matmul_False - AssertionError: Tensor-likes are not close!
FAILED [1.4067s] test_fp8.py::TestFP8Lowering::test_tensorwise_scaling_float32_shape_16,16,32_has_bias_False_use_fast_accum_True_persistent_matmul_False - AssertionError: Tensor-likes are not close!
Also, this still does not work: https://github.com/pytorch/pytorch/issues/130359
import torch
a_data = torch.ones(32, 128, device="cuda").to(torch.float8_e4m3fn).contiguous()
b_data = torch.ones(64, 128, device="cuda").to(torch.float8_e4m3fn).t()
a_scale = torch.ones(a_data.shape[0], 1, device="cuda", dtype=torch.float32)
b_scale = torch.ones(1, b_data.shape[1], device="cuda", dtype=torch.float32)
torch._scaled_mm(
a_data,
b_data,
scale_a=a_scale,
scale_b=b_scale,
out_dtype=torch.bfloat16,
)
Thanks!
Admittedly, I developed this PR without having access to SM89 hardware, instead relying solely on PyTorch CI for testing. There is an SM89 machine in the mix there, and CI was passing... Can you let me know what is the problem reported with the code at the bottom of your message?
Hi @alexsamardzic i'm happy to help test anything on SM89 hardware, just ping me.
Here is the output from the above script running on the latest torch nightly:
Out[3]: void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
void cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 8, 32>, 32, cutlass::float_e4m3_t, cutlass::layout::RowMajor, cutlass::float_e4m3_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, Operator_>::operator()(cutlass::Array<float, 4, true> &, const cutlass::Array<cutlass::float_e4m3_t, 16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Array<float, 4, true> &) const [with Operator_ = cutlass::arch::OpMultiplyAdd] not implemented
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File /usr/local/lib/python3.12/dist-packages/IPython/core/formatters.py:770, in PlainTextFormatter.__call__(self, obj)
763 stream = StringIO()
764 printer = pretty.RepresentationPrinter(stream, self.verbose,
765 self.max_width, self.newline,
766 max_seq_length=self.max_seq_length,
767 singleton_pprinters=self.singleton_printers,
768 type_pprinters=self.type_printers,
769 deferred_pprinters=self.deferred_printers)
--> 770 printer.pretty(obj)
771 printer.flush()
772 return stream.getvalue()
File /usr/local/lib/python3.12/dist-packages/IPython/lib/pretty.py:419, in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if (
410 cls is not object
411 # check if cls defines __repr__
(...)
417 and callable(_safe_getattr(cls, "__repr__", None))
418 ):
--> 419 return _repr_pprint(obj, self, cycle)
421 return _default_pprint(obj, self, cycle)
422 finally:
File /usr/local/lib/python3.12/dist-packages/IPython/lib/pretty.py:794, in _repr_pprint(obj, p, cycle)
792 """A pprint that just redirects to the normal repr function."""
793 # Find newlines and replace them with p.break_()
--> 794 output = repr(obj)
795 lines = output.splitlines()
796 with p.group():
File /usr/local/lib/python3.12/dist-packages/torch/_tensor.py:590, in Tensor.__repr__(self, tensor_contents)
586 return handle_torch_function(
587 Tensor.__repr__, (self,), self, tensor_contents=tensor_contents
588 )
589 # All strings are unicode in Python 3.
--> 590 return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File /usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py:704, in _str(self, tensor_contents)
702 with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
703 guard = torch._C._DisableFuncTorch() # noqa: F841
--> 704 return _str_intern(self, tensor_contents=tensor_contents)
File /usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py:621, in _str_intern(inp, tensor_contents)
619 tensor_str = _tensor_str(self.to_dense(), indent)
620 else:
--> 621 tensor_str = _tensor_str(self, indent)
623 if self.layout != torch.strided:
624 suffixes.append("layout=" + str(self.layout))
File /usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py:353, in _tensor_str(self, indent)
349 return _tensor_str_with_formatter(
350 self, indent, summarize, real_formatter, imag_formatter
351 )
352 else:
--> 353 formatter = _Formatter(get_summarized_data(self) if summarize else self)
354 return _tensor_str_with_formatter(self, indent, summarize, formatter)
File /usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py:389, in get_summarized_data(self)
387 start = [self[i] for i in range(0, PRINT_OPTS.edgeitems)]
388 end = [self[i] for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))]
--> 389 return torch.stack([get_summarized_data(x) for x in (start + end)])
390 else:
391 return torch.stack([get_summarized_data(x) for x in self])
File /usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py:379, in get_summarized_data(self)
377 if dim == 1:
378 if self.size(0) > 2 * PRINT_OPTS.edgeitems:
--> 379 return torch.cat(
380 (self[: PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems :])
381 )
382 else:
383 return self
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Thanks for providing the details. Unfortunately, I don't see immediately what could be the reason of error.
My only guess at the moment is that PyTorch build you use is not compiled for SM89. Is "sm_89" there if you run following in Python:
import torch
torch.cuda.get_arch_list()
If that not the issue: Would it be possible for you to provide output of nvidia-smi and dmesg | grep GPU commands in your shell? Also, eventually the output of the same script if run with CUDA_LAUNCH_BLOCKING=1 python script_name.py?
Are you maybe familiar with building PyTorch from the source, in order to try to make some changes in the code and eventually pinpoint the cause of the error?
Hi @alexsamardzic
import torch
torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'compute_90']
I didn't realize that pytorch has to be specifically built for SM89 support... I'm confused as I've been running several float8 functions, e.g., float8_dynamic_activation_float8_weight which calls is_sm_at_least_89 and it runs correctly. We also use float8 for training. The server has 4x NVIDIA RTX 6000 Ada Cards which are all SM89.
Are there instructions for building pytorch to support SM89 (or to get to show up on the list of architectures)? Thanks!
Hmm, then I guess an issue should be opened to ask why nightly is not built with SM89 support.
Building with SM89 support means that GPU kernels that are provided by PyTorch itself, and that use SM89 specific features, are going to be enabled - the kernel implementing torch._scaled_mm for FP8 inputs is such a kernel. The is_sm_at_least_89 is just a query function, not a kernel that is to be run on GPU, so it is expected to work. The float8_dynamic_activation_float8_weight is a function performing quantization, and is probably implemented as a GPU kernel, but while it build float8 values, it doesn't do calculations with these values, so the GPU code built for SM80/SM86 probably gets used, and it may work fine too. In principle, it could be even that there are some kernels provided by PyTorch, doing matrix multiplication over float8 values, that would work on SM89 GPU - this is because PyTorch use external libraries, primarily cuBLAS, for some of its kernels, and these libraries are compiled outside of PyTorch, and may have SM89 specific stuff enabled when built. So the problem is just that this particular kernel is provided in the form of source code by PyTorch itself, but SM89 is not enabled as target when nightly built.
Instructions to build PyTorch from source are here. At some point, it was mentioned that "Other potentially useful environment variables may be found in setup.py". Relevant environment variables to set in your case may be (I'm writing it for Bash shell):
export CMAKE_BUILD_TYPE=Release
export USE_CUDA=1
export TORCH_CUDA_ARCH_LIST="8.9"
To speed up compilation, you may want also to set:
export USE_PRECOMPILED_HEADERS=1
export USE_PER_OPERATOR_HEADERS=1
export CCACHE_COMPRESS=true
export CCACHE_SLOPPINESS=pch_defines,time_macros
Thanks @alexsamardzic
I also tried the stable torch build and it didn't include sm_89.
I just asked @ptrblck: https://discuss.pytorch.org/t/sm-89-not-listed-in-the-torch-cuda-get-arch-list/215827/3
you don’t need to build for sm_89 as it’s binary compatible with sm_86/sm_80
I've seen that answer mentioned in a few places, but of course, sm86 does not support float8/FP8
I'm super-confused, and not sure what to do next... building pytorch sounds like a lot of work, and we are super-busy here...
Yeah, this statement about binary compatibility is true for everything else except for FP8 related stuff on SM89.
I understand building PyTorch seems daunting, on the other side I don't have any experience in packaging so that I could provide you with a package that I can build with SM89 enabled... Probably it would make sense to open a new issue here, asking that SM89 get enabled for stable/nightly builds, and see what packaging people have to say.
Done! https://github.com/pytorch/pytorch/issues/145632