ompi icon indicating copy to clipboard operation
ompi copied to clipboard

v5.x failure beyond 94 nodes

Open tonycurtis opened this issue 1 year ago • 4 comments

Thank you for taking the time to submit an issue!

Background information

Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main @ 448c3ba2d1b8dced090e5aefb7ccb07588613bcd

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source / git

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179)
 0f0a90006cbc880d499b2356d6076e785e7868ba 3rd-party/prrte (psrvr-v2.0.0rc1-4819-g0f0a90006c)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main-1-gdfff675)

Please describe the system on which you are running

  • Operating system/version: Rocky 8.4
  • Computer hardware: aarch64
  • Network type: IB

Details of the problem

Beyond 94 nodes

--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-login2-2232463@0,0] on node login2
  Remote daemon: [prterun-login2-2232463@0,28] on node fj094

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

tonycurtis avatar Oct 19 '24 20:10 tonycurtis

Could you try running

mpirun --debug-daemons -np 95 -N 1 hostname

that may help provide some more info for triaging.

hppritcha avatar Oct 19 '24 21:10 hppritcha

Er, Howard, do you mean run 95 ranks on 1 node? Or run hostname on 1 rank on 95 nodes?

Anyway:

salloc -p all-nodes -N 95  mpirun --debug-daemons hostname

out.txt

tonycurtis avatar Oct 19 '24 22:10 tonycurtis

The mpirun cmd line will execute one instance of hostname on each of 95 nodes. The --debug-daemons flag will hold the stdout/stderr connections open between the daemons so you can see any error messages. I'm not sure if that salloc command will do the same thing, but will take a gander at the output.

rhc54 avatar Oct 20 '24 00:10 rhc54

Looks like the daemons are unable to send a message back to mpirun - maybe there is an issue with their choice of transport? You might need to specify the network they should use.

rhc54 avatar Oct 20 '24 01:10 rhc54

Well it's an mpirun inside a salloc. Note this means mpirun is running on the launch (in this case, login) node, not inside the job per se. I.e. interactive test, not batch. Tried using UCX_TLS=rc,self explicitly but same result. 4.1.6 has no issues.

If I run an interactive job and then mpirun from inside a compute node. it works with v5.

tonycurtis avatar Oct 21 '24 18:10 tonycurtis

UCX has nothing to do with it - not used by the runtime. Might be a difference in how we select transports between OMPI v4 and v5, but I can't say for sure. I'll have to ponder what might be going on - not hearing of scaling issues elsewhere.

rhc54 avatar Oct 21 '24 19:10 rhc54

Well, I didn't think it was UCX related, but did due diligence. I'm pretty sure I have reported this same issue either for ompi or pmix in the past.

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

FWIW: the reason I suspect a network connection issue is due to your observations that all works fine if you execute from a compute node, but you hit a problem if launching from the login node. The RTE doesn't know or care about that difference, but we do see things frequently getting into trouble because the login node requires that you use a "management interface" and the daemons on the compute node don't know they should preferentially select it. So the daemon can't find a way to "phone home" and things fail.

Your output looks exactly like that situation, and your reports tend to support it. Are you setting "if_include" or "if_exclude" params somewhere (default param files, environment)? There was a bug in PRRTE that made it ignore those settings until a recent commit, which would explain why OMPI v4.x might work but v5.x doesn't. You might update the PMIx and PRRTE submodules to see if that fixes the problem.

One catch: the PRRTE submodule is pointing at an OMPI fork of the upstream PRRTE and typically runs some distance behind the upstream repo. I would recommend re-targeting it at the upstream and then "pull" to get a full update.

rhc54 avatar Oct 21 '24 19:10 rhc54

I'll give the if_include etc a go to see if that identifies the issue. Seems possible.

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

Yeah, that seems to be it. Excluded the external interface, program runs

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

All submodules are up to date, BTW

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

Hooray!! Glad it solved the problem!

rhc54 avatar Oct 21 '24 19:10 rhc54

This is one of those "obviously an interface choice" in hindsight things. Testing same procedure with 5.0.5 release.

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

I'm afraid that won't work - the "if_include/exclude" patch isn't in any OMPI release yet.

rhc54 avatar Oct 21 '24 19:10 rhc54

Ah. Well. Fortunately I/we have kept our ompi release / default version for users at 4.1.6. So this isn't going to cause issues for 99.9% of them. I'll let things percolate through and catch up before moving the cluster install of ompi to v5.

thanks!

tonycurtis avatar Oct 21 '24 19:10 tonycurtis

I'll close this issue, then.

tonycurtis avatar Oct 21 '24 19:10 tonycurtis