cutlass [BUG] wmma should be enabled w/ clang.

Describe the bug cutlass currently disables WMMA instructions when compiled with clang.

https://github.com/NVIDIA/cutlass/blob/e1cd8c7866dd6de02b66a89879795e7d7301aacc/include/cutlass/arch/wmma.h#L37-L38

The comment is no longer valid and the tests work fine with the condition above removed.

Steps/Code to reproduce bug Build test/unit/gemm/warp/gemm_sm75.cu with clang and run it on A100. Currently the test fils with:

[----------] 5 tests from SM75_warp_gemm_tensor_op_crosswise_b1
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_64x64x512_8x8x128
void cutlass::arch::Mma<cutlass::gemm::GemmShape<8, 8, 128>, 32, cutlass::integer_subbyte<1, false>, cutlass::layout::RowMajor, cutlass::integer_subbyte<1, false>, cutlass::layout::ColumnMajor, int, cutlass::layout::RowMajor, cutlass::arch::OpXorPopc>::operator()(FragmentC &, const FragmentA &, const FragmentB &, const FragmentC &) const not implemented
...

Expected behavior With the condition above removed, the tests work fine:

[----------] 5 tests from SM75_warp_gemm_tensor_op_crosswise_b1
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_64x64x512_8x8x128
[       OK ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_64x64x512_8x8x128 (79 ms)
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_64x32x512_8x8x128
[       OK ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_64x32x512_8x8x128 (58 ms)
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_32x32x512_8x8x128
[       OK ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_32x32x512_8x8x128 (40 ms)
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_32x16x512_8x8x128
[       OK ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_32x16x512_8x8x128 (53 ms)
[ RUN      ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_16x16x512_8x8x128
[       OK ] SM75_warp_gemm_tensor_op_crosswise_b1.128x128x512_16x16x512_8x8x128 (38 ms)
[----------] 5 tests from SM75_warp_gemm_tensor_op_crosswise_b1 (270 ms total)

Dec 20 '24 19:12 Artem-B

This needs to be fixed in a few more places:

diff --git a/include/cutlass/arch/wmma.h b/include/cutlass/arch/wmma.h
--- a/include/cutlass/arch/wmma.h
+++ b/include/cutlass/arch/wmma.h
@@ -34,9 +34,6 @@

 #pragma once

-// CUTLASS WMMA does not support clang at present.
-#if !(defined(__clang__) && defined(__CUDA__))
-
 #if (__CUDACC_VER_MAJOR__ >= 9)
 #if (!defined(__CUDA_ARCH__) || (__CUDA_ARCH__ >= 700))
 #define CUTLASS_ARCH_WMMA_ENABLED
@@ -58,8 +55,6 @@
 #endif
 #endif

-#endif //!(defined(__clang__) && defined(__CUDA__))
-
 #if defined(CUTLASS_ARCH_WMMA_ENABLED)

 #include <mma.h>
diff --git a/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h b/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
--- a/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
+++ b/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h
@@ -43,8 +43,6 @@

 #pragma once

-#if !(defined(__clang__) && defined(__CUDA__))
-
 #include "third_party/gpus/cutlass/include/cutlass/wmma_array.h"
 #include "third_party/gpus/cutlass/include/cutlass/layout/matrix.h"

@@ -158,7 +156,3 @@ public:

 ////////////////////////////////////////////////////////////////////////////////

-#else
-#error (defined(__clang__) && defined(__CUDA__))
-#endif // !defined(__clang__)
-
diff --git a/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h b/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
--- a/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
+++ b/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h
@@ -34,8 +34,6 @@

 #pragma once

-#if !(defined(__clang__) && defined(__CUDA__))
-
 #include "third_party/gpus/cutlass/include/cutlass/cutlass.h"
 #include "third_party/gpus/cutlass/include/cutlass/wmma_array.h"
 #include "third_party/gpus/cutlass/include/cutlass/layout/matrix.h"
@@ -223,5 +221,3 @@ public:

 /////////////////////////////////////////////////////////////////////////////////////////////////

-#endif // !defined(__clang__)
-

Dec 20 '24 19:12 Artem-B

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jan 19 '25 20:01 github-actions[bot]

ping.

Jan 19 '25 20:01 Artem-B

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 18 '25 21:02 github-actions[bot]

another ping. Is there anybody out there?

Feb 19 '25 01:02 Artem-B

@hwu36 @d-k-b do you folks know why this is disabled for clang?

@Artem-B do you really want WMMA for clang? AFAIK no one really uses it since it comes with a 50% perf penalty over using the PTX for Ampere/Turing directly.

Feb 19 '25 01:02 thakkarV

we can re-enable them if clang are happy now. PR please?

Feb 19 '25 01:02 hwu36

@thakkarV -- as far as i recall, as @Artem-B mentioned, clang originally did not support wmma properly so we disabled it when adding initial clang support. if it's working now, i see no reason to keep it disabled.

Feb 19 '25 16:02 d-k-b

@Junkai-Wu

Mar 04 '25 03:03 hwu36

I'll add reenable them in our next release.

Mar 13 '25 08:03 Junkai-Wu

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 12 '25 09:04 github-actions[bot]

Fixed in cutlass v3.9

May 28 '25 17:05 Artem-B