qmcpack
qmcpack copied to clipboard
intermittent failures of converter_test_* on crusher
Describe the bug A clear and concise description of what the bug is.
Can converter_test_* be run in parallel on crusher? With ctest -R converter_test_ --output-on-failure everything passes but with ctest -R converter_test_ -j 64 --output-on-failure I get a bunch of test failures.
For example:
19/20 Test #65: converter_test_aldet1 ............***Failed Error regular expression found in output. Regex=[ FAIL] 15.99 sec
Stderr not empty
srun: Job 180229 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 180229
To Reproduce Steps to reproduce the behavior:
- release version or git commit hash being built commit 863c1397eea6f257c8592df36e57915de7d7c695
- cmake command
cmake -DMPIEXEC_EXECUTABLE=
which srun-DBOOST_ROOT=~/boost_1_78_0/ -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DENABLE_OFFLOAD=ON -DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON -DHIP_ARCH=gfx90a ../ - full program/test invocation command ctest -R converter_test_ -j 64 --output-on-failure
- additional steps
Expected behavior A clear and concise description of what you expected to happen.
All deterministic tests pass on crusher
System: crusher
- modules loaded [e.g. output of
module list] https://github.com/QMCPACK/qmcpack/blob/863c1397eea6f257c8592df36e57915de7d7c695/config/build_olcf_crusher_afar.sh - other systems where this is reproducible [e.g. "my laptop", "none"]
Additional context Add any other context about the problem here.
What is the error message in stderr file?