cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[QST] Is it possible to detect output coordinates in elementwise epilogue ?

Open ankutalev opened this issue 1 year ago • 4 comments

What is your question?

Hello! I want to implement elementwise epilogue, which depends on output matrix coordinates, i.e.

d_ij = F(alpha * sum_k(a_ik * b_kj) + c_ij, i, j)

It feels like providing own ActivationFunctor into cutlass::epilogue::thread::LinearCombinationGeneric are the simplest and intuitive way to implement own custom epilogue, but it looks like impossible to determine element coordinates inside ActivationFunctor::operator().

Am I right that there is no easy way of doing this?

Also I'm fine with any solution based on any cutlass major version (2/3).

Thanks!

ankutalev avatar Feb 05 '24 14:02 ankutalev

For CUTLASS 3.x epilogues based on CuTe, its trivial to inject the coordinate from the collective epi into the thread functor.

We already create the coordinate tensor for the purposes of in-bounds check and predication here: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp#L231

You can just write a custom epi thread op and pass the C and D coordinates into it.

I.e. extend this line:

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp#L307

to include the coordinate tensor output as well (tDcDmn tensor in the code)

Same applies to the direct store non-optimized epi

thakkarV avatar Feb 05 '24 15:02 thakkarV

@hwu36 can help answer for 2.x API epilogues.

thakkarV avatar Feb 05 '24 15:02 thakkarV

@thakkarV thank you, I will check your proposal and response later.

But I will be pleased to have answer for 2.x API too, so response from @hwu36 will not be superfluous.

ankutalev avatar Feb 05 '24 15:02 ankutalev

I feel it needs some plumbing to get it work in cutlass 2.x. when the epilogue calls store or load, the coordinates are calculated inside the iterator: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h#L410-L411

activation function is called however in the upper level: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/epilogue.h#L374 which is right before calling the iterator to call the store. So, if you need to use cutlass 2.x, you need to use the iterator logic to compute row, col and then choose the activation func, and then calling store.

hwu36 avatar Feb 05 '24 22:02 hwu36

@ankutalev has your issues been resolved?

mnicely avatar Feb 22 '24 14:02 mnicely

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 23 '24 15:03 github-actions[bot]

@mnicely , yes, sorry for late reply

ankutalev avatar Mar 25 '24 10:03 ankutalev