rcps-buildscripts
rcps-buildscripts copied to clipboard
MPI and RMPISNOW alterations for R 4.1.1
IN: 04932896
- In the modulefile for
r/4.1.1-openblas/gnu-10.2.0
, this line needs to be changed:
setenv OMPI_MCA_btl tcp,sm,self
to
setenv OMPI_MCA_btl tcp,vader,self
(No sm
for OpenMPI 4). This may be able to be dropped entirely, because the mpirun wrapper for OpenMPI ought to be setting this correctly already.
This was a problem on Myriad, MPI within single node.
- Check if RMPISNOW is giving this error for a user with no existing R profile and fix:
GERun: GErun command being run:
GERun: mpirun RMPISNOW
Loading required package: utils
Error: cannot add binding of '.MPIrun' to the base environment
Execution halted
and see https://stackoverflow.com/questions/68198277/error-cannot-add-binding-of-first-to-the-base-environment for reason and fix syntax. This was reported on Kathleen, 2-node job. Was fixed by copying RMPISNOW and RMPISNOW_profile to user's home dir and modifying based on that link.
https://cran.r-project.org/doc/manuals/r-devel/NEWS.html R 4.1.0 had
The base environment and its namespace are now locked (so one can no longer add bindings to these or remove from these).
Test jobs submitted with our RMPISNOW example to confirm failure.
On Myriad with one node, I do get this error:
GERun: mpirun RMPISNOW
Loading required package: utils
Error: cannot add binding of '.MPIrun' to the base environment
I do not get a btl error at this stage (from the comment in the module, that ought to only be a problem with single-node Rmpi on its own, which calls openmpi as a library and ignores mpirun - but still needs updating).
I'm checking if R 4.0.2 has this problem using the same example files that we've used before for testing.
R 4.0.2 (which uses Open MPI 3.1.5 and GNU 9.2.0) snow jobs on Myriad (12 cores) and Kathleen (80 cores/2 nodes) work without modifications.
R 4.1.1
I get this error
As of version 3.0.0, the "sm" BTL is no longer available in Open MPI.
Efficient, high-speed same-node shared memory communication support in
Open MPI is available in the "vader" BTL. To use the vader BTL, you
can re-run your job with:
mpirun --mca btl vader,self,... your_mpi_application
If I run Rmpi only, with no mpirun.
R 4.1.1
I'm getting this error when setting
export OMPI_MCA_btl='tcp,vader,self'
mpirun -np 1 R CMD BATCH test_rmpi.r test_rmpi_${NSLOTS}_${JOB_ID}.out
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: node-f00a-001
Local PID: 291503
Peer hostname: node-f00a-001 ([[21106,1],0])
Source IP of socket: 10.128.17.101
Known IPs of peer:
10.34.4.101
10.128.25.101
192.168.122.1
Note this is on one node. It seems to be this issue https://github.com/open-mpi/ompi/issues/5818 with tcp when there are multiple interfaces available so it isn't sending and receiving to itself on the same one...
Setting btl_tcp_if_include
to a particular interface or subnet might work.
It gives the error and then hangs until it runs out of wallclock time.
10.128.17.101 is ib0
on that node.
10.34.4.101 is eno1
10.128.25.101 is ib0:0
192.168.122.1 is virbr0
Maybe we want to tell it to use eno1
since IB is the storage network? (Again, this is local).
I had better check what happens with 2 nodes on the Economics bit of Myriad.
Fails init:
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node-d97a-006:00462] *** An error occurred in MPI_Init
[node-d97a-006:00462] *** reported by process [2134441986,50]
[node-d97a-006:00462] *** on a NULL communicator
[node-d97a-006:00462] *** Unknown error
[node-d97a-006:00462] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-d97a-006:00462] *** and potentially your MPI job)
The 2-node Rmpi processes output these errors:
On head node
[1635856407.612162] [node-d97a-005:224742:0] wireup.c:910 UCX ERROR old: am_lane 0 wireup_lane 255 reachable_mds 0x1ab
[1635856407.791953] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR old: lane[0]: 0:posix/memory.0 md[0] <proxy> -> md[0]/posix am am_bw#0
[1635856407.830051] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR old: lane[1]: 12:cma/memory.0 md[7] -> md[7]/cma rma_bw#0
[1635856407.859858] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR old: lane[2]: 8:dc_mlx5/mlx5_0:1.0 md[5] -> md[5]/ib rma_bw#1
[1635856407.894855] [node-d97a-005:224742:0] wireup.c:910 UCX ERROR new: am_lane 0 wireup_lane 2 reachable_mds 0x1ab
[1635856407.954865] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR new: lane[0]: 0:posix/memory.0 md[0] <proxy> -> md[0]/posix am am_bw#0
[1635856407.987926] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR new: lane[1]: 13:knem/memory.0 md[8] -> md[8]/knem rma_bw#0
[1635856408.023856] [node-d97a-005:224742:0] wireup.c:918 UCX ERROR new: lane[2]: 7:rc_mlx5/mlx5_0:1.0 md[5] -> md[5]/ib rma_bw#1 wireup
[node-d97a-005:224742:0:225496] wireup.c:1055 Fatal: endpoint reconfiguration not supported yet
On second node
[node-d97a-006.myriad.ucl.ac.uk:00596] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[node-d97a-006.myriad.ucl.ac.uk:00596] [[32569,2],57] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
I think the 2-node problem is a similar device-choice issue.
For within-node Rmpi, setting these = happy and working.
export OMPI_MCA_btl='tcp,vader,self'
export OMPI_MCA_btl_tcp_if_include='eno1'
Does not fix it for 2-node.
Oh, the R module also sets
setenv OMPI_MCA_pml ob1
which ought to be ucx
here, I am pretty sure.
(ob1
is fine for single node, we set --mca mtl '^psm2' -mca pml ucx -mca btl ^usnic
for multi-node OpenMPI 4 on Myriad).
I think the ob1 setting was just copied from the R 4.0.2 module file.
Still fails 2 node with ucx anyway.
We'll fix the RMPISNOW issues and come back to Rmpi alone at a later date.
Last bit of diagnosis for Rmpi, this time on Kathleen. I set this, the first two lines so it wouldn't have anything specified for them to start with.
export OMPI_MCA_btl=""
export OMPI_MCA_pml=""
export OMPI_MCA_plm_base_verbose=10
export OMPI_MCA_mtl_base_verbose=10
export OMPI_MCA_btl_base_verbose=10
BTL, MTL chosen
[node-c11a-120:07530] select: initializing btl component usnic
[node-c11a-120:07530] btl:usnic: disqualifiying myself due to fi_getinfo(3) failure: No data available (-61)
[node-c11a-120:07530] select: init of component usnic returned failure
[node-c11a-120:07530] mca: base: close: component usnic closed
[node-c11a-120:07530] mca: base: close: unloading component usnic
[node-c11a-120:07530] select: initializing btl component tcp
[node-c11a-120:07530] select: init of component tcp returned success
[node-c11a-120:07530] select: initializing btl component vader
[node-c11a-120:07530] select: init of component vader returned failure
[node-c11a-120:07530] mca: base: close: component vader closed
[node-c11a-120:07530] mca: base: close: unloading component vader
[node-c11a-120:07530] select: initializing btl component self
[node-c11a-120:07530] select: init of component self returned success
[node-c11a-120:07530] mca: base: components_register: registering framework mtl components
[node-c11a-120:07530] mca: base: components_register: found loaded component psm2
[node-c11a-120:07530] mca: base: components_register: component psm2 register function successful
[node-c11a-120:07530] mca: base: components_register: found loaded component ofi
[node-c11a-120:07530] mca: base: components_register: component ofi register function successful
[node-c11a-120:07530] mca: base: components_open: opening mtl components
[node-c11a-120:07530] mca: base: components_open: found loaded component psm2
[node-c11a-120:07530] mca: base: components_open: component psm2 open function successful
[node-c11a-120:07530] mca: base: components_open: found loaded component ofi
[node-c11a-120:07530] mca: base: components_open: component ofi open function successful
[node-c11a-120:07530] mca:base:select: Auto-selecting mtl components
[node-c11a-120:07530] mca:base:select:( mtl) Querying component [psm2]
[node-c11a-120:07530] mca:base:select:( mtl) Query of component [psm2] set priority to 40
[node-c11a-120:07530] mca:base:select:( mtl) Querying component [ofi]
[node-c11a-120:07530] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[node-c11a-120:07530] mca:base:select:( mtl) Selected component [psm2]
[node-c11a-120:07530] mca: base: close: component ofi closed
[node-c11a-120:07530] mca: base: close: unloading component ofi
[node-c11a-120:07530] select: initializing mtl component psm2
[node-c11a-120:07530] select: init returned success
[node-c11a-120:07530] select: component psm2 selected
Second node has this line extra:
[node-c11a-133:50941] [[34508,2],39] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
.e file has:
[node-c11a-120:05027] [[34508,0],0] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for la
unching
[node-c11a-120:05027] [[34508,0],0] plm:rsh: final template argv:
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template> PATH=/shared/ucl/apps/openmpi/4.0
.5/gnu-10.2.0/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/lib:$LD_LIBRARY_PA
TH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/lib:$DYLD_LIBRARY_PATH ; e
xport DYLD_LIBRARY_PATH ; /shared/ucl/apps/openmpi/4.0.5/gnu-10.2.0/bin/orted -mca ess "env" -mca ess_base_jobid "
2261516288" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "node-c[2:11]a-120,node
-c[2:11]a-133@0(2)" -mca orte_hnp_uri "2261516288.0;tcp://10.128.80.144,10.128.96.144:38336" -mca plm "rsh" --tree-s
pawn -mca routed "radix" -mca orte_parent_uri "2261516288.0;tcp://10.128.80.144,10.128.96.144:38336" -mca btl_base_v
erbose "10" -mca pml "" -mca mtl_base_verbose "10" -mca plm_base_verbose "10" -mca btl "" -mca pmix "^s1,s2,cray,iso
lated"
Starting server daemon at host "node-c11a-133"
Server daemon successfully started with task id "1.node-c11a-133"
Establishing /opt/geassist/bin/rshcommand session to host node-c11a-133.kathleen.ucl.ac.uk ...
[node-c11a-133:48613] mca: base: components_register: registering framework plm components
[node-c11a-133:48613] mca: base: components_register: found loaded component rsh
[node-c11a-133:48613] mca: base: components_register: component rsh register function successful
[node-c11a-133:48613] mca: base: components_open: opening plm components
[node-c11a-133:48613] mca: base: components_open: found loaded component rsh
[node-c11a-133:48613] mca: base: components_open: component rsh open function successful
[node-c11a-133:48613] mca:base:select: Auto-selecting plm components
[node-c11a-133:48613] mca:base:select:( plm) Querying component [rsh]
[node-c11a-133:48613] mca:base:select:( plm) Query of component [rsh] set priority to 10
[node-c11a-133:48613] mca:base:select:( plm) Selected component [rsh]
[node-c11a-133:48613] [[34508,0],1] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
[node-c11a-120:05027] [[34508,0],0] complete_setup on job [34508,1]
[node-c11a-120:05027] [[34508,0],0] complete_setup on job [34508,2]
[node-c11a-120:05027] [[34508,0],0] plm:base:receive update proc state command from [[34508,0],1]
[node-c11a-120:05027] [[34508,0],0] plm:base:receive got update_proc_state for job [34508,2]
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node-c11a-133:51211] *** An error occurred in MPI_Init
[node-c11a-133:51211] *** reported by process [2261516290,76]
[node-c11a-133:51211] *** on a NULL communicator
Spawn on second node is failing, I think.
(I keep finding this search result https://github.com/open-mpi/ompi/issues/8938 which is for mpi4py and on Mellanox whereas Kathleen is OmniPath, but they decided to give up and use sockets instead of MPI spawn. Which is what RMPISNOW does to set up Rmpi, if I recall).
The build script that needs to be updated to include the RMPISNOW fixes is:
../build_scripts/R-4.1.1_MPI_install
in the build scripts repro.
I'm now adding the fixes ...
The fixes have been added to the build scripts repro. Downloading to Kathleen and testing ...
I've run the updated R-4.1.1_MPI_install build script on Kathleen and submitted a snow test job on 80 cores.
My test job still fails so investigating what I've done wrong!