rcps-buildscripts
rcps-buildscripts copied to clipboard
Install Request: OpenMPI 4.1.1
[IN:04819158], RCE-934.
Requested for Orca 5 binaries. (Could then also install those).
Thought this should be fast but OpenMPI 4.1.x needs a newer libpsm2 than we have in the image to be able to run on the OmniPath clusters, see #409 where we went back to 4.0.x instead.
Going to see if a source build of PSM2 will work - is supposed to be buildable from source on RHEL 7.2 onwards.
OpenMPI config looking promising!
configure:333774: checking if MCA component mtl:psm2 can compile
configure:333776: result: yes
The result before was
configure:333631: WARNING: PSM2 needs to be version 11.2.173 or later. Disabling MTL.
configure:334237: checking if MCA component mtl:psm2 can compile
configure:334239: result: no
Working across two nodes on Thomas!
Install PSM2 on OmniPath clusters (on Myriad we're using UCX)
- [x] Young
- [x] Kathleen
- [x] Thomas
- [x] modulefile
Install OpenMPI 4.1.1 everywhere
- [x] Young
- [x] Kathleen
- [x] Myriad
- [x] Thomas
- [x] modulefile
Myriad needs UCX 1.9.0 for OpenMPI 4.1.1 (bug in 1.8.0) to be able to run multi-node, changing to that.
Now running fine multi-node on Myriad too.
These modules needed on not-Myriad:
module unload -f compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
These needed on Myriad:
module unload compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load binutils/2.29.1/gnu-4.9.2
module load ucx/1.9.0/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2
Is not working across two nodes on Young...
node-c12m-005.22538PSM2 can't open hfi unit: -1 (err=23)
node-c12l-008.62402PSM2 can't open hfi unit: -1 (err=23)
node-c12m-005.22538hfi_userinit_internal: assign_context command failed: Device or resource busy
node-c12m-005.22538hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
For now we have set OMPI_MCA_btl=vader
in the modulefile for mpi/openmpi/4.1.1/gnu-4.9.2
on the OmniPath clusters so it will work multi-node, even if a bit slower than it should if using a different transport layer.