qmcpack icon indicating copy to clipboard operation
qmcpack copied to clipboard

intermittent failures of converter_test_* on crusher

Open quantumsteve opened this issue 3 years ago • 1 comments

Describe the bug A clear and concise description of what the bug is.

Can converter_test_* be run in parallel on crusher? With ctest -R converter_test_ --output-on-failure everything passes but with ctest -R converter_test_ -j 64 --output-on-failure I get a bunch of test failures.

For example:

19/20 Test #65: converter_test_aldet1 ............***Failed  Error regular expression found in output. Regex=[  FAIL] 15.99 sec
Stderr not empty
srun: Job 180229 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 180229 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 180229

To Reproduce Steps to reproduce the behavior:

  1. release version or git commit hash being built commit 863c1397eea6f257c8592df36e57915de7d7c695
  2. cmake command cmake -DMPIEXEC_EXECUTABLE=which srun -DBOOST_ROOT=~/boost_1_78_0/ -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DENABLE_OFFLOAD=ON -DENABLE_CUDA=ON -DQMC_CUDA2HIP=ON -DHIP_ARCH=gfx90a ../
  3. full program/test invocation command ctest -R converter_test_ -j 64 --output-on-failure
  4. additional steps

Expected behavior A clear and concise description of what you expected to happen.

All deterministic tests pass on crusher

System: crusher

  • modules loaded [e.g. output of module list] https://github.com/QMCPACK/qmcpack/blob/863c1397eea6f257c8592df36e57915de7d7c695/config/build_olcf_crusher_afar.sh
  • other systems where this is reproducible [e.g. "my laptop", "none"]

Additional context Add any other context about the problem here.

quantumsteve avatar Sep 13 '22 14:09 quantumsteve

What is the error message in stderr file?

ye-luo avatar Sep 13 '22 15:09 ye-luo