dbcsr CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue

Open hfp opened this issue 10 months ago • 6 comments

          This PR seems to cause:

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue.

( tested on H100 device )

Originally posted by @hfp in https://github.com/cp2k/dbcsr/issues/767#issuecomment-2034752764

Apr 03 '24 14:04 hfp

According to the CUDA description:

cudaLimitPrintfFifoSize controls the size in bytes of the shared FIFO used by the printf() device system call. Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned.

But then we don't call any printf (all are masked). And I don't understand why we see this problem only on H100...

Apr 03 '24 14:04 alazzaro

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

Apr 03 '24 14:04 hfp

I do not understand it either, I have simply not root-caused the issue let alone reporting the software versions like CUDA (or HPCSDK). I am currently retrying with this change.

it makes sense...

Apr 03 '24 14:04 alazzaro

Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

Apr 03 '24 14:04 hfp

Let's leave this ticket open... I think the issue here is when the RT fails to build a kernel, but I'm not sure...

Apr 16 '24 14:04 alazzaro

(Taking over from https://github.com/cp2k/dbcsr/pull/777#issuecomment-2059160289)

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

I start to think this is the right solution... But need more time to investigate it (see my previous comment).

Apr 16 '24 14:04 alazzaro

dbcsr dbcsr copied to clipboard

CUDA RUNTIME API error: DeviceSetLimit failed with error cudaErrorInvalidValue

dbcsr
dbcsr copied to clipboard