dbcsr
dbcsr copied to clipboard
CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue
This PR seems to cause:
CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.
( tested on H100 device )
Originally posted by @hfp in https://github.com/cp2k/dbcsr/issues/767#issuecomment-2034752764
According to the CUDA description:
cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.
But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...
I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.
I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.
it makes sense...
Since DeviceSetLimit
is governed by ACC_API_CALL
, the symbol NDEBUG
must not be defined for reproducing the issue.
Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...
(Taking over from https://github.com/cp2k/dbcsr/pull/777#issuecomment-2059160289)
I think we can move the call to a more convenient place...
What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.
I wonder if the code in question should be removed entirely?
I start to think this is the right solution... But need more time to investigate it (see my previous comment).