ompi icon indicating copy to clipboard operation
ompi copied to clipboard

MPI 4.0.5 hangs

Open TerryInFab opened this issue 2 years ago • 3 comments

Hi,

I've installed mpi 4.0.5 on my KylinOS-x86, the info of "uname -a" is: Linux iZ2zee1iwih8aq4osgv4z6Z 4.19.90-24.4.v2101.ky10.x86_64 #1 SMP Mon May 24 12:14:55 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

and when I run: mpirun --allow-run-as-root --bind-to none -n 3 -host 101.200.199.85:3 ls /tmp, it will hang for half an hour and then successfully connect to that remote host with the result of "ls /tmp".

Attached is the output of : strace mpirun --allow-run-as-root --bind-to none -n 3 -host 101.200.199.85:3 ls /tmp

Could you see if there is any way to solve this? Thanks in advance.


logs_of_strace.txt

TerryInFab avatar Aug 14 '23 13:08 TerryInFab

the last line of the txt file is exactly where it hangs for a long time.

TerryInFab avatar Aug 14 '23 13:08 TerryInFab

You are stracing yout mpirun process, not your 3 application processes, so it is difficult to see why it deadlock. The last line is a blocking poll, it is a normal function where processes wait for communications from other processes, so waiting there would be expected behavior. The question is what happened to the expected message ?

The fact that you run completes after 30 minutes indicate there is a timeout that triggers. If your host system has multiple IP interfaces, it is possible that a connection message will wait for an answer or a timeout. As you run ls /tmp, you do not start MPI processes, I would look at the ORTE layer first. Pick your main ethernet device (eth0 or en0 usually) and force the OOB to use it by doing mpirun -np 3 -host .... --mca oob_tcp_if_include eth0 <*the rest of your arguments*>.

bosilca avatar Aug 14 '23 14:08 bosilca

@TerryInFab If you need further help, please send the information described here: https://docs.open-mpi.org/en/v5.0.x/getting-help.html#for-problems-launching-mpi-or-openshmem-applications

Technically, that's a help page for v5.0.x; you need to change one thing for the v4.0.x/v4.1.x series: in the commands shown, change --prtemca to --mca.

jsquyres avatar Aug 17 '23 14:08 jsquyres

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Mar 05 '24 17:03 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 19 '24 21:03 github-actions[bot]