dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

Fixed #776

Open hfp opened this issue 10 months ago • 7 comments

  • Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
  • Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.

hfp avatar Apr 04 '24 13:04 hfp

Issue #776 was discovered when testing with enabled assertions, i.e., DBCSR's CUDA tests may have assertions removed. Perhaps it is valuable to test with enabled assertions.

hfp avatar Apr 04 '24 14:04 hfp

@hfp for my understanding:

  • We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...
  • We really never call printf on the GPU, do we?
  • I don't understand the relation between NDEBUG and ACC_API_CALL

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

alazzaro avatar Apr 04 '24 14:04 alazzaro

(BTW, trying to recover the Daint-CI output...)

alazzaro avatar Apr 04 '24 14:04 alazzaro

CSCS CI seems broken on their side:

+ sbatch --wait --time=0:15:00 --account=g90 --job-name=DBCSR.gnu.build --output=sbatch.jenkins-g90-DBCSR-1116.gnu.build.out .ci/daint.cscs.ch/gnu.build.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
java.nio.file.NoSuchFileException: /users/jenkg90/workspace/g90/DBCSR/sbatch.jenkins-g90-DBCSR-1116.gnu.build.out

but we have budget... Please ignore it for the moment.

alazzaro avatar Apr 04 '24 15:04 alazzaro

The cp2k regression tests on Piz Daint are also disable, because the project g90 has expired. sbatch returns

project "g90" expired on 2024-03-31

@juerghutter could you have a look?

mkrack avatar Apr 05 '24 08:04 mkrack

Project g90 is open again (until 2025-03-31).


From: Matthias Krack @.***> Sent: Friday, April 5, 2024 10:13 AM To: cp2k/dbcsr Cc: Jürg Hutter; Mention Subject: Re: [cp2k/dbcsr] Fixed #776 (PR #777)

The cp2k regression tests on Piz Dainthttps://dashboard.cp2k.org/index.html are also disable, because the project g90 has expired. sbatch returns

project "g90" expired on 2024-03-31

@juerghutterhttps://github.com/juerghutter could you have a look?

— Reply to this email directly, view it on GitHubhttps://github.com/cp2k/dbcsr/pull/777#issuecomment-2039219944, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD2WEURGQJDACA44SJJ3L7TY3ZMMNAVCNFSM6AAAAABFXKHJWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZZGIYTSOJUGQ. You are receiving this because you were mentioned.Message ID: @.***>

juerghutter avatar Apr 05 '24 09:04 juerghutter

@hfp for my understanding:

* We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...

Ok, I just assumed this because the issue came up when I removed NDEBUG. I will check grep through CP2K just to make sure there is nothing else.

* We really never call printf on the GPU, do we?

No, we don't. Perhaps someone did so during development and wanted to keep this setting.

* I don't understand the relation between NDEBUG and [ACC_API_CALL](https://github.com/cp2k/dbcsr/blob/6db5b28d236de28e7293f783a3c2cc672d202f6b/src/acc/cuda/acc_cuda.h#L29)

ACK; see above (me neither ;-).

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

OK, this is good to go in principle. However, I will move the call into the init function.

hfp avatar Apr 10 '24 12:04 hfp

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

hfp avatar Apr 16 '24 13:04 hfp

I rebased the PR and if it's green (let's hope for Daint-CI), I will merge it. Removing (or moving) the code in question might be another PR.

hfp avatar Apr 16 '24 14:04 hfp

ACK.

alazzaro avatar Apr 16 '24 14:04 alazzaro