cutlass icon indicating copy to clipboard operation
cutlass copied to clipboard

[BUG] Misaligned address when running GEMM with SM90 EVT-based epilogue

Open kadeng opened this issue 1 year ago • 10 comments

When running a standalone Cutlass GEMM with a generated SM90 EVT-based epilogue which loads two auxiliary inputs ( one broadcasted, one with full dimensionality), I get a CUDA error about a misaligned access. Since all participating tensors have at least an alignment of 512 bytes, this could be a Cutlass bug. On manual inspection, I could not see a problem in the user side of the code.

Using cuda-gdb to break at the point of error, and using the command "x/10i $errorpc" shows that the CUDA instruction pointer is on an "UTMALDG.2D" SASS instruction where the error happens.

One of the inputs (called X) loaded as auxiliary input is actually also operand A for the GEMM.

Code to reproduce, environment info and build / run instructions are here: https://gist.github.com/kadeng/1e44299d22ce5a11da55ad0e5f328d3f

The code is generated as part of the experimental Cutlass backend for Pytorch's inductor JIT compiler.

kadeng avatar Dec 06 '23 21:12 kadeng

If this example is changed such that the loaded auxiliary operand is of the same shape but not the same (pointer) as operand A, the error does not happen. So it's likely an address conflict. Is there anything that can be done to allow this? It's a pretty common thing to have these kind of residual connections, e.g. having something like "activation(a @ b) + a"

kadeng avatar Dec 06 '23 21:12 kadeng

@thakkarV @richardmcai

hwu36 avatar Dec 06 '23 22:12 hwu36

Just as additional info: Adding -DNDEBUG and -O3 and removing -g and -lineinfo from the build flags does not make a difference here in my tests.

kadeng avatar Dec 12 '23 19:12 kadeng

@kadeng can you clarify what the shapes of these operands are supposed to be? EVT currently only supports loading of MNL-shape tensors, or broadcasting scalars/vectors to MNL-shape tensors. It's not clear to me what the result of activation(a @ b) + a is supposed to be since the activation(a @ b) is shape MNL and a is shape MKL, unless we assume that N == K here.

If this example is changed such that the loaded auxiliary operand is of the same shape but not the same (pointer) as operand A, the error does not happen.

do you mean not of the same shape as A?

richardmcai avatar Dec 12 '23 20:12 richardmcai

If remember correctly, A and B are both square matrices of Same shape here. You can find the details in the gist, where you will find a standalone source code example to reproduce it, including all shapes.

kadeng avatar Dec 12 '23 20:12 kadeng

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Jan 11 '24 21:01 github-actions[bot]

@kadeng did you resolve your issue?

mnicely avatar Feb 22 '24 15:02 mnicely

No, this is still a bug as far as I can tell. It's not urgent, though, since we're not using auxiliary inputs anymore.

kadeng avatar Feb 22 '24 15:02 kadeng

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Mar 23 '24 16:03 github-actions[bot]

@apuaaChen, your first assignment :)

hwu36 avatar Apr 18 '24 17:04 hwu36

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar May 18 '24 18:05 github-actions[bot]