rcps-buildscripts icon indicating copy to clipboard operation
rcps-buildscripts copied to clipboard

Install Request: OpenMPI 4.1.1

Open heatherkellyucl opened this issue 3 years ago • 7 comments

[IN:04819158], RCE-934.

Requested for Orca 5 binaries. (Could then also install those).

Thought this should be fast but OpenMPI 4.1.x needs a newer libpsm2 than we have in the image to be able to run on the OmniPath clusters, see #409 where we went back to 4.0.x instead.

Going to see if a source build of PSM2 will work - is supposed to be buildable from source on RHEL 7.2 onwards.

heatherkellyucl avatar Aug 17 '21 15:08 heatherkellyucl

OpenMPI config looking promising!

configure:333774: checking if MCA component mtl:psm2 can compile
configure:333776: result: yes

The result before was

configure:333631: WARNING: PSM2 needs to be version 11.2.173 or later. Disabling MTL.
configure:334237: checking if MCA component mtl:psm2 can compile
configure:334239: result: no

heatherkellyucl avatar Aug 17 '21 15:08 heatherkellyucl

Working across two nodes on Thomas!

Install PSM2 on OmniPath clusters (on Myriad we're using UCX)

  • [x] Young
  • [x] Kathleen
  • [x] Thomas
  • [x] modulefile

Install OpenMPI 4.1.1 everywhere

  • [x] Young
  • [x] Kathleen
  • [x] Myriad
  • [x] Thomas
  • [x] modulefile

heatherkellyucl avatar Aug 18 '21 08:08 heatherkellyucl

Myriad needs UCX 1.9.0 for OpenMPI 4.1.1 (bug in 1.8.0) to be able to run multi-node, changing to that.

heatherkellyucl avatar Aug 18 '21 10:08 heatherkellyucl

Now running fine multi-node on Myriad too.

heatherkellyucl avatar Aug 18 '21 12:08 heatherkellyucl

These modules needed on not-Myriad:

module unload -f compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load psm2/11.2.185/gnu-4.9.2
module load mpi/openmpi/4.1.1/gnu-4.9.2

These needed on Myriad:

module unload compilers mpi
module load compilers/gnu/4.9.2
module load numactl/2.0.12
module load binutils/2.29.1/gnu-4.9.2 
module load ucx/1.9.0/gnu-4.9.2 
module load mpi/openmpi/4.1.1/gnu-4.9.2

heatherkellyucl avatar Aug 18 '21 13:08 heatherkellyucl

Is not working across two nodes on Young...

node-c12m-005.22538PSM2 can't open hfi unit: -1 (err=23)
node-c12l-008.62402PSM2 can't open hfi unit: -1 (err=23)
node-c12m-005.22538hfi_userinit_internal: assign_context command failed: Device or resource busy
node-c12m-005.22538hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)

heatherkellyucl avatar Sep 02 '21 14:09 heatherkellyucl

For now we have set OMPI_MCA_btl=vader in the modulefile for mpi/openmpi/4.1.1/gnu-4.9.2 on the OmniPath clusters so it will work multi-node, even if a bit slower than it should if using a different transport layer.

heatherkellyucl avatar Jul 24 '23 16:07 heatherkellyucl