easybuild-easyblocks
easybuild-easyblocks copied to clipboard
GROMACS (GROMACS-2020-fosscuda-2019b.eb) fails to install due to failed HardwareTopologyTest
Hi all,
We are encountering an error on our HPC cluster (Bessemer) while compiling GROMACS using easybuild with the GROMACS-2020-fosscuda-2019b.eb easyconfig (and others).
The details for this cluster can be found here: https://docs.hpc.shef.ac.uk/en/latest/bessemer/cluster_specs.html
The docs should contain all the info needed - the easybuild version is 4.3.1.
The failure seems to be limited to HWloc but the log output is not exceptionally useful. I will attach our compiler log file - could you assist?
In the short term I will likely attempt to bypass the testing portion of the compile and see what happens.
Pertinent bit:
==========] Running 5 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from CpuInfoTest
[ RUN ] CpuInfoTest.SupportLevel
[ OK ] CpuInfoTest.SupportLevel (0 ms)
[----------] 1 test from CpuInfoTest (0 ms total)
[----------] 4 tests from HardwareTopologyTest
[ RUN ] HardwareTopologyTest.Execute
[ OK ] HardwareTopologyTest.Execute (116 ms)
[ RUN ] HardwareTopologyTest.HwlocExecute
/scratch/1526855/slurm_build_1526855/GROMACS/2020/fosscuda-2019b/gromacs-2020/src/gromacs/hardware/tests/hardwaretopology.cpp:88: Failure
Expected: (hwTop.supportLevel()) >= (gmx::HardwareTopology::SupportLevel::Basic), actual: 4-byte object <01-00 00-00> vs 4-byte object <02-00 00-00>
Cannot determine basic hardware topology from hwloc. GROMACS will still
work, but it might affect your performance for large nodes.
Please mail [email protected] so we can try to fix it.
[ FAILED ] HardwareTopologyTest.HwlocExecute (64 ms)
Anyone seen this before? Given we're an x86 / x64 cluster and errors related to this typically refer to other architectures I am a bit stumped. OS: CentOS Linux release 7.9.2009 (Core)
Attaching the log: gromacscompilefailure.log
Check the gromacs mailing list, I know I've seen it mentioned before, just can't remember where.
Aye I saw the following: https://www.mail-archive.com/[email protected]/msg40764.html
Which was my plan to compile skipping that test.
Having now skipped the hwloc issue we are running into OpenMPI issues rather than threadMPI (given usempi:false is working) where during the second iteration section of the build it is failing to complete the MPI tests correctly- line 29043:
PMI2_Init failed to intialize. Return code: 14
Our config for the OpenMPI in question as we are using SLURM https://github.com/RSE-Sheffield/bessemer-eb-cfg/blob/staging/etc/bessemer/easyconfigs/o/OpenMPI/OpenMPI-3.1.4-gcccuda-2019b.eb
Given we are running different versions of hwloc installed on the cluster vs the one used in the build this may be the cause for the issue with hwloc but we're clearly having issues with MPI
We are also seeing similar issues with building Rmpi although these may be unrelated - https://www.mail-archive.com/[email protected]/msg05558.html
Our build log when forcing a rebuild: srun --export=all -c 4 eb -drl --rebuild GROMACS-2020-fosscuda-2019b.eb