cutlass
cutlass copied to clipboard
[QST]question about cutlass epilogue customization
What is your question? May I ask if the epilogue of Cutlass supports customization? I hope to achieve the functionality of performing bias addition after the matmul operation in Cutlass. Additionally, I would like to apply different activation functions to different regions (for example, sigmoid and tanh). Is it possible to implement this? If so, could you please teach me how to do it?
Yes, you can do any elementwise operation you want. Are you using cutlass 2.x or 3.x? Which architecture?
Yes, you can do any elementwise operation you want. Are you using cutlass 2.x or 3.x? Which architecture?
I am using 3.x and A100 sm80, could you help me?
Yes, you can do any elementwise operation you want. Are you using cutlass 2.x or 3.x? Which architecture?
I would like to apply different activation functions to different regions in one A*B result (for example, sigmoid and tanh)
Can cutlass do this? @hwu36
@thakkarV to comment on how to do it with 3.x on A100.
cc @apuaaChen for thoughts on how to do this with CUTLASS 3.x SM80 EVT (likely would need some added ops)
this should be similar to epilogue scatter fusion since it needs to compute row number, too.
this is pretty easy to do in the CUTLASS 3 epilogues, but not something you can do OOTB so you will have to make a minor modification for a custom epi. One additional branch to dispatch to the activation function depending on the coordinate of the output tensor. You have access to the coordinate tensors already for the purpose of predicated stores to gmem, so just use those. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp#L230
btw, I do not want to discourage you from using the 3.x API on Ampere, its totally kosher, we just recommend 2.x API for best performing Ampere kernels since they are well tuned over the years.
the alternative solution is to just launch two different kernels on two separate streams, which will likely give you equivalent or perhaps even better perf depending on the problem shapes and if this boundary of activation functor is within or across output tiles.
thank you everyone!you help me a lot. I will try it morning.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
@zwshan have you resolved your issue?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Hi! It can be supported by adding a new node. In order to get the row number, you can following the examples here https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/epilogue/threadblock/fusion/visitor_store.hpp#L749-L752 The coord at line 749 is the row number, column number, batch number.