ompi
ompi copied to clipboard
v5.x failure beyond 94 nodes
Thank you for taking the time to submit an issue!
Background information
Running installation tests on cluster, v5 (release or from github) works up to <= 94 nodes, then fails instantly. v4 works fine. N.B. this is running a job from a login node via SLURM (salloc + mpiexec).
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
main @ 448c3ba2d1b8dced090e5aefb7ccb07588613bcd
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
source / git
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
e32e0179bc6bd1637f92690511ce6091719fa046 3rd-party/openpmix (v1.1.3-4036-ge32e0179)
0f0a90006cbc880d499b2356d6076e785e7868ba 3rd-party/prrte (psrvr-v2.0.0rc1-4819-g0f0a90006c)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main-1-gdfff675)
Please describe the system on which you are running
- Operating system/version: Rocky 8.4
- Computer hardware: aarch64
- Network type: IB
Details of the problem
Beyond 94 nodes
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-login2-2232463@0,0] on node login2
Remote daemon: [prterun-login2-2232463@0,28] on node fj094
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
Could you try running
mpirun --debug-daemons -np 95 -N 1 hostname
that may help provide some more info for triaging.
Er, Howard, do you mean run 95 ranks on 1 node? Or run hostname on 1 rank on 95 nodes?
Anyway:
salloc -p all-nodes -N 95 mpirun --debug-daemons hostname
The mpirun cmd line will execute one instance of hostname on each of 95 nodes. The --debug-daemons flag will hold the stdout/stderr connections open between the daemons so you can see any error messages. I'm not sure if that salloc command will do the same thing, but will take a gander at the output.
Looks like the daemons are unable to send a message back to mpirun - maybe there is an issue with their choice of transport? You might need to specify the network they should use.
Well it's an mpirun inside a salloc. Note this means mpirun is running on the launch (in this case, login) node, not inside the job per se. I.e. interactive test, not batch. Tried using UCX_TLS=rc,self explicitly but same result. 4.1.6 has no issues.
If I run an interactive job and then mpirun from inside a compute node. it works with v5.
UCX has nothing to do with it - not used by the runtime. Might be a difference in how we select transports between OMPI v4 and v5, but I can't say for sure. I'll have to ponder what might be going on - not hearing of scaling issues elsewhere.
Well, I didn't think it was UCX related, but did due diligence. I'm pretty sure I have reported this same issue either for ompi or pmix in the past.
FWIW: the reason I suspect a network connection issue is due to your observations that all works fine if you execute from a compute node, but you hit a problem if launching from the login node. The RTE doesn't know or care about that difference, but we do see things frequently getting into trouble because the login node requires that you use a "management interface" and the daemons on the compute node don't know they should preferentially select it. So the daemon can't find a way to "phone home" and things fail.
Your output looks exactly like that situation, and your reports tend to support it. Are you setting "if_include" or "if_exclude" params somewhere (default param files, environment)? There was a bug in PRRTE that made it ignore those settings until a recent commit, which would explain why OMPI v4.x might work but v5.x doesn't. You might update the PMIx and PRRTE submodules to see if that fixes the problem.
One catch: the PRRTE submodule is pointing at an OMPI fork of the upstream PRRTE and typically runs some distance behind the upstream repo. I would recommend re-targeting it at the upstream and then "pull" to get a full update.
I'll give the if_include etc a go to see if that identifies the issue. Seems possible.
Yeah, that seems to be it. Excluded the external interface, program runs
All submodules are up to date, BTW
Hooray!! Glad it solved the problem!
This is one of those "obviously an interface choice" in hindsight things. Testing same procedure with 5.0.5 release.
I'm afraid that won't work - the "if_include/exclude" patch isn't in any OMPI release yet.
Ah. Well. Fortunately I/we have kept our ompi release / default version for users at 4.1.6. So this isn't going to cause issues for 99.9% of them. I'll let things percolate through and catch up before moving the cluster install of ompi to v5.
thanks!
I'll close this issue, then.