rcps-buildscripts MPI and RMPISNOW alterations for R 4.1.1

IN: 04932896

In the modulefile for r/4.1.1-openblas/gnu-10.2.0, this line needs to be changed:

setenv           OMPI_MCA_btl tcp,sm,self

to

setenv           OMPI_MCA_btl tcp,vader,self

(No sm for OpenMPI 4). This may be able to be dropped entirely, because the mpirun wrapper for OpenMPI ought to be setting this correctly already. This was a problem on Myriad, MPI within single node.

Check if RMPISNOW is giving this error for a user with no existing R profile and fix:

GERun: GErun command being run:
GERun:  mpirun RMPISNOW
Loading required package: utils
Error: cannot add binding of '.MPIrun' to the base environment
Execution halted

and see https://stackoverflow.com/questions/68198277/error-cannot-add-binding-of-first-to-the-base-environment for reason and fix syntax. This was reported on Kathleen, 2-node job. Was fixed by copying RMPISNOW and RMPISNOW_profile to user's home dir and modifying based on that link.

https://cran.r-project.org/doc/manuals/r-devel/NEWS.html R 4.1.0 had

The base environment and its namespace are now locked (so one can no longer add bindings to these or remove from these).

Oct 29 '21 16:10 heatherkellyucl

Test jobs submitted with our RMPISNOW example to confirm failure.

Nov 01 '21 11:11 heatherkellyucl

On Myriad with one node, I do get this error:

GERun:  mpirun RMPISNOW
Loading required package: utils
Error: cannot add binding of '.MPIrun' to the base environment

I do not get a btl error at this stage (from the comment in the module, that ought to only be a problem with single-node Rmpi on its own, which calls openmpi as a library and ignores mpirun - but still needs updating).

Nov 01 '21 15:11 heatherkellyucl

I'm checking if R 4.0.2 has this problem using the same example files that we've used before for testing.

Nov 01 '21 15:11 balston

R 4.0.2 (which uses Open MPI 3.1.5 and GNU 9.2.0) snow jobs on Myriad (12 cores) and Kathleen (80 cores/2 nodes) work without modifications.

Nov 01 '21 16:11 balston

R 4.1.1

I get this error

As of version 3.0.0, the "sm" BTL is no longer available in Open MPI.

Efficient, high-speed same-node shared memory communication support in
Open MPI is available in the "vader" BTL.  To use the vader BTL, you
can re-run your job with:

    mpirun --mca btl vader,self,... your_mpi_application

If I run Rmpi only, with no mpirun.

Nov 01 '21 16:11 heatherkellyucl

R 4.1.1

I'm getting this error when setting

export OMPI_MCA_btl='tcp,vader,self'
mpirun -np 1 R CMD BATCH test_rmpi.r test_rmpi_${NSLOTS}_${JOB_ID}.out

Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          node-f00a-001
  Local PID:           291503
  Peer hostname:       node-f00a-001 ([[21106,1],0])
  Source IP of socket: 10.128.17.101
  Known IPs of peer:   
        10.34.4.101
        10.128.25.101
        192.168.122.1

Note this is on one node. It seems to be this issue https://github.com/open-mpi/ompi/issues/5818 with tcp when there are multiple interfaces available so it isn't sending and receiving to itself on the same one...

Setting btl_tcp_if_include to a particular interface or subnet might work.

It gives the error and then hangs until it runs out of wallclock time.

Nov 02 '21 11:11 heatherkellyucl

10.128.17.101 is ib0 on that node.

10.34.4.101 is eno1 10.128.25.101 is ib0:0 192.168.122.1 is virbr0

Maybe we want to tell it to use eno1 since IB is the storage network? (Again, this is local).

Nov 02 '21 12:11 heatherkellyucl

I had better check what happens with 2 nodes on the Economics bit of Myriad.

Fails init:

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node-d97a-006:00462] *** An error occurred in MPI_Init
[node-d97a-006:00462] *** reported by process [2134441986,50]
[node-d97a-006:00462] *** on a NULL communicator
[node-d97a-006:00462] *** Unknown error
[node-d97a-006:00462] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-d97a-006:00462] ***    and potentially your MPI job)

Nov 02 '21 12:11 heatherkellyucl

The 2-node Rmpi processes output these errors:

On head node

[1635856407.612162] [node-d97a-005:224742:0]         wireup.c:910  UCX  ERROR old: am_lane 0 wireup_lane 255 reachable_mds 0x1ab
[1635856407.791953] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR old: lane[0]:  0:posix/memory.0 md[0] <proxy>  -> md[0]/posix    am am_bw#0
[1635856407.830051] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR old: lane[1]: 12:cma/memory.0 md[7]            -> md[7]/cma      rma_bw#0
[1635856407.859858] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR old: lane[2]:  8:dc_mlx5/mlx5_0:1.0 md[5]      -> md[5]/ib       rma_bw#1
[1635856407.894855] [node-d97a-005:224742:0]         wireup.c:910  UCX  ERROR new: am_lane 0 wireup_lane 2 reachable_mds 0x1ab
[1635856407.954865] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR new: lane[0]:  0:posix/memory.0 md[0] <proxy>  -> md[0]/posix    am am_bw#0
[1635856407.987926] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR new: lane[1]: 13:knem/memory.0 md[8]           -> md[8]/knem     rma_bw#0
[1635856408.023856] [node-d97a-005:224742:0]         wireup.c:918  UCX  ERROR new: lane[2]:  7:rc_mlx5/mlx5_0:1.0 md[5]      -> md[5]/ib       rma_bw#1 wireup  
[node-d97a-005:224742:0:225496]      wireup.c:1055 Fatal: endpoint reconfiguration not supported yet

On second node

[node-d97a-006.myriad.ucl.ac.uk:00596] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[node-d97a-006.myriad.ucl.ac.uk:00596] [[32569,2],57] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493

Nov 02 '21 14:11 heatherkellyucl

I think the 2-node problem is a similar device-choice issue.

For within-node Rmpi, setting these = happy and working.

export OMPI_MCA_btl='tcp,vader,self'
export OMPI_MCA_btl_tcp_if_include='eno1'

Does not fix it for 2-node.

Nov 02 '21 14:11 heatherkellyucl

Oh, the R module also sets

setenv           OMPI_MCA_pml ob1

which ought to be ucx here, I am pretty sure.

(ob1 is fine for single node, we set --mca mtl '^psm2' -mca pml ucx -mca btl ^usnic for multi-node OpenMPI 4 on Myriad).

Nov 02 '21 15:11 heatherkellyucl

I think the ob1 setting was just copied from the R 4.0.2 module file.

Nov 02 '21 15:11 balston

Still fails 2 node with ucx anyway.

Nov 02 '21 15:11 heatherkellyucl

We'll fix the RMPISNOW issues and come back to Rmpi alone at a later date.

Last bit of diagnosis for Rmpi, this time on Kathleen. I set this, the first two lines so it wouldn't have anything specified for them to start with.

export OMPI_MCA_btl=""
export OMPI_MCA_pml=""
export OMPI_MCA_plm_base_verbose=10
export OMPI_MCA_mtl_base_verbose=10
export OMPI_MCA_btl_base_verbose=10

BTL, MTL chosen

[node-c11a-120:07530] select: initializing btl component usnic
[node-c11a-120:07530] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-c11a-120:07530] select: init of component usnic returned failure
[node-c11a-120:07530] mca: base: close: component usnic closed
[node-c11a-120:07530] mca: base: close: unloading component usnic
[node-c11a-120:07530] select: initializing btl component tcp
[node-c11a-120:07530] select: init of component tcp returned success
[node-c11a-120:07530] select: initializing btl component vader
[node-c11a-120:07530] select: init of component vader returned failure
[node-c11a-120:07530] mca: base: close: component vader closed
[node-c11a-120:07530] mca: base: close: unloading component vader
[node-c11a-120:07530] select: initializing btl component self
[node-c11a-120:07530] select: init of component self returned success
[node-c11a-120:07530] mca: base: components_register: registering framework mtl components
[node-c11a-120:07530] mca: base: components_register: found loaded component psm2
[node-c11a-120:07530] mca: base: components_register: component psm2 register function successful
[node-c11a-120:07530] mca: base: components_register: found loaded component ofi
[node-c11a-120:07530] mca: base: components_register: component ofi register function successful
[node-c11a-120:07530] mca: base: components_open: opening mtl components
[node-c11a-120:07530] mca: base: components_open: found loaded component psm2
[node-c11a-120:07530] mca: base: components_open: component psm2 open function successful
[node-c11a-120:07530] mca: base: components_open: found loaded component ofi
[node-c11a-120:07530] mca: base: components_open: component ofi open function successful
[node-c11a-120:07530] mca:base:select: Auto-selecting mtl components
[node-c11a-120:07530] mca:base:select:(  mtl) Querying component [psm2]
[node-c11a-120:07530] mca:base:select:(  mtl) Query of component [psm2] set priority to 40
[node-c11a-120:07530] mca:base:select:(  mtl) Querying component [ofi]
[node-c11a-120:07530] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[node-c11a-120:07530] mca:base:select:(  mtl) Selected component [psm2]
[node-c11a-120:07530] mca: base: close: component ofi closed
[node-c11a-120:07530] mca: base: close: unloading component ofi
[node-c11a-120:07530] select: initializing mtl component psm2
[node-c11a-120:07530] select: init returned success
[node-c11a-120:07530] select: component psm2 selected

Second node has this line extra:

[node-c11a-133:50941] [[34508,2],39] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493

.e file has:

[node-c11a-120:05027] [[34508,0],0] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for la
unching
[node-c11a-120:05027] [[34508,0],0] plm:rsh: final template argv:
        /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template>        PATH=/shared/ucl/apps/openmpi/4.0
.5/gnu-10.2.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/lib:$LD_LIBRARY_PA
TH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/lib:$DYLD_LIBRARY_PATH ; e
xport DYLD_LIBRARY_PATH ;   /shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/bin/orted -mca ess "env" -mca ess_base_jobid "
2261516288" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "node-c[2:11]a-120,node
-c[2:11]a-133@0(2)" -mca orte_hnp_uri "2261516288.0;tcp://10.128.80.144,10.128.96.144:38336" -mca plm "rsh" --tree-s
pawn -mca routed "radix" -mca orte_parent_uri "2261516288.0;tcp://10.128.80.144,10.128.96.144:38336" -mca btl_base_v
erbose "10" -mca pml "" -mca mtl_base_verbose "10" -mca plm_base_verbose "10" -mca btl "" -mca pmix "^s1,s2,cray,iso
lated"
Starting server daemon at host "node-c11a-133"
Server daemon successfully started with task id "1.node-c11a-133"
Establishing /opt/geassist/bin/rshcommand session to host node-c11a-133.kathleen.ucl.ac.uk ...
[node-c11a-133:48613] mca: base: components_register: registering framework plm components
[node-c11a-133:48613] mca: base: components_register: found loaded component rsh
[node-c11a-133:48613] mca: base: components_register: component rsh register function successful
[node-c11a-133:48613] mca: base: components_open: opening plm components
[node-c11a-133:48613] mca: base: components_open: found loaded component rsh
[node-c11a-133:48613] mca: base: components_open: component rsh open function successful
[node-c11a-133:48613] mca:base:select: Auto-selecting plm components
[node-c11a-133:48613] mca:base:select:(  plm) Querying component [rsh]
[node-c11a-133:48613] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[node-c11a-133:48613] mca:base:select:(  plm) Selected component [rsh]
[node-c11a-133:48613] [[34508,0],1] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
[node-c11a-120:05027] [[34508,0],0] complete_setup on job [34508,1]
[node-c11a-120:05027] [[34508,0],0] complete_setup on job [34508,2]
[node-c11a-120:05027] [[34508,0],0] plm:base:receive update proc state command from [[34508,0],1]
[node-c11a-120:05027] [[34508,0],0] plm:base:receive got update_proc_state for job [34508,2]
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node-c11a-133:51211] *** An error occurred in MPI_Init
[node-c11a-133:51211] *** reported by process [2261516290,76]
[node-c11a-133:51211] *** on a NULL communicator

Spawn on second node is failing, I think.

(I keep finding this search result https://github.com/open-mpi/ompi/issues/8938 which is for mpi4py and on Mellanox whereas Kathleen is OmniPath, but they decided to give up and use sockets instead of MPI spawn. Which is what RMPISNOW does to set up Rmpi, if I recall).

Nov 03 '21 10:11 heatherkellyucl

The build script that needs to be updated to include the RMPISNOW fixes is:

../build_scripts/R-4.1.1_MPI_install

in the build scripts repro.

Nov 03 '21 12:11 balston

I'm now adding the fixes ...

Nov 03 '21 12:11 balston

The fixes have been added to the build scripts repro. Downloading to Kathleen and testing ...

Nov 03 '21 14:11 balston

I've run the updated R-4.1.1_MPI_install build script on Kathleen and submitted a snow test job on 80 cores.

Nov 03 '21 15:11 balston

My test job still fails so investigating what I've done wrong!

Nov 03 '21 16:11 balston

rcps-buildscripts rcps-buildscripts copied to clipboard

MPI and RMPISNOW alterations for R 4.1.1

rcps-buildscripts
rcps-buildscripts copied to clipboard