dbcsr
dbcsr copied to clipboard
Fixed #776
- Citation: "Setting cudaLimitPrintfFifoSize must not be performed after launching any kernel that uses the printf() device system call - in such case cudaErrorInvalidValue will be returned."
- Since DeviceSetLimit is governed by ACC_API_CALL, the symbol NDEBUG must not be defined for reproducing the issue.
Issue #776 was discovered when testing with enabled assertions, i.e., DBCSR's CUDA tests may have assertions removed. Perhaps it is valuable to test with enabled assertions.
@hfp for my understanding:
- We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...
- We really never call printf on the GPU, do we?
- I don't understand the relation between NDEBUG and ACC_API_CALL
In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...
(BTW, trying to recover the Daint-CI output...)
CSCS CI seems broken on their side:
+ sbatch --wait --time=0:15:00 --account=g90 --job-name=DBCSR.gnu.build --output=sbatch.jenkins-g90-DBCSR-1116.gnu.build.out .ci/daint.cscs.ch/gnu.build.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
java.nio.file.NoSuchFileException: /users/jenkg90/workspace/g90/DBCSR/sbatch.jenkins-g90-DBCSR-1116.gnu.build.out
but we have budget... Please ignore it for the moment.
The cp2k regression tests on Piz Daint are also disable, because the project g90
has expired. sbatch
returns
project "g90" expired on 2024-03-31
@juerghutter could you have a look?
Project g90 is open again (until 2025-03-31).
From: Matthias Krack @.***> Sent: Friday, April 5, 2024 10:13 AM To: cp2k/dbcsr Cc: Jürg Hutter; Mention Subject: Re: [cp2k/dbcsr] Fixed #776 (PR #777)
The cp2k regression tests on Piz Dainthttps://dashboard.cp2k.org/index.html are also disable, because the project g90 has expired. sbatch returns
project "g90" expired on 2024-03-31
@juerghutterhttps://github.com/juerghutter could you have a look?
— Reply to this email directly, view it on GitHubhttps://github.com/cp2k/dbcsr/pull/777#issuecomment-2039219944, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD2WEURGQJDACA44SJJ3L7TY3ZMMNAVCNFSM6AAAAABFXKHJWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZZGIYTSOJUGQ. You are receiving this because you were mentioned.Message ID: @.***>
@hfp for my understanding:
* We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...
Ok, I just assumed this because the issue came up when I removed NDEBUG
. I will check grep through CP2K just to make sure there is nothing else.
* We really never call printf on the GPU, do we?
No, we don't. Perhaps someone did so during development and wanted to keep this setting.
* I don't understand the relation between NDEBUG and [ACC_API_CALL](https://github.com/cp2k/dbcsr/blob/6db5b28d236de28e7293f783a3c2cc672d202f6b/src/acc/cuda/acc_cuda.h#L29)
ACK; see above (me neither ;-).
In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...
OK, this is good to go in principle. However, I will move the call into the init function.
I think we can move the call to a more convenient place...
What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.
I wonder if the code in question should be removed entirely?
I rebased the PR and if it's green (let's hope for Daint-CI), I will merge it. Removing (or moving) the code in question might be another PR.
ACK.