dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue

Open hfp opened this issue 10 months ago • 6 comments

          This PR seems to cause:

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.

( tested on H100 device )

Originally posted by @hfp in https://github.com/cp2k/dbcsr/issues/767#issuecomment-2034752764

hfp avatar Apr 03 '24 14:04 hfp

According to the CUDA description:

cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.

But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...

alazzaro avatar Apr 03 '24 14:04 alazzaro

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

hfp avatar Apr 03 '24 14:04 hfp

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

it makes sense...

alazzaro avatar Apr 03 '24 14:04 alazzaro

Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

hfp avatar Apr 03 '24 14:04 hfp

Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...

alazzaro avatar Apr 16 '24 14:04 alazzaro

(Taking over from https://github.com/cp2k/dbcsr/pull/777#issuecomment-2059160289)

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

I start to think this is the right solution... But need more time to investigate it (see my previous comment).

alazzaro avatar Apr 16 '24 14:04 alazzaro