cutlass [QST][CuTeDSL] How to warp-reduce `half2/bfloat162`

What is your question?

CuTeDSL lacks of half2/bfloat162 types. It is hard to describe a warp reduce that take half2/bfloat162 as input type.

In CUDA C++, I can use

for (...) {
   auto max = hmax2(max, __shfl_down(val, ..., ...))
}

to do two reduce in the same __shfl_down instruction.

In CuTeDSL, can I do the same thing? Or if I do something like

for (...) {
  auto max_1 = hmax(max_1, __shfl_down(val1, ..., ...))
  auto max_2 = hmax(max_2, __shfl_down(val2, ..., ...))
}

, will jit compiler or PTX as fuse the two __shfl_down/hmax into one instruction for me?

Jul 18 '25 06:07 reyoung

What is your question?

CuTeDSL lacks of half2/bfloat162 types. It is hard to describe a warp reduce that take half2/bfloat162 as input type.

In CUDA C++, I can use
for (...) {
   auto max = hmax2(max, __shfl_down(val, ..., ...))
}
to do two reduce in the same __shfl_down instruction.

In CuTeDSL, can I do the same thing? Or if I do something like
for (...) {
  auto max_1 = hmax(max_1, __shfl_down(val1, ..., ...))
  auto max_2 = hmax(max_2, __shfl_down(val2, ..., ...))
}
, will jit compiler or PTX as fuse the two __shfl_down/hmax into one instruction for me?

Thanks for reporting. I think there might be two bugs

res = max(res, __shfl_down(val, ..., ...) should work for vectorized data
vectorized operation should be handled by compiler to generate hmax2 instruction

@brandon-yujie-sun

Jul 23 '25 03:07 fengxie

I do not think this is a bug.

I am confused about what the best practice is when we want the compiler to generate a vectorised type or operation?

Option 1: Maybe CuTeDSL can export vectorised types? Option 2: Perhaps I can merge two bfloat16 values into a uint32 value and manually write PTX instructions for this value. and hope ptxas will reuse register for me. Option 3: Maybe CuTeDSL can ensure the vectorised operation is generated by adding some Python constraints?

Jul 29 '25 11:07 reyoung

Option 1: Maybe CuTeDSL can export vectorised types?

TensorSSA is vectorized type. For this case, using TensorSSA should just work. But we currently don't handle this correctly. res = max(res, __shfl_down(val, ..., ...)

Jul 30 '25 15:07 fengxie

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 29 '25 16:08 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Nov 27 '25 16:11 github-actions[bot]