ompi icon indicating copy to clipboard operation
ompi copied to clipboard

A process or daemon was unable to complete a TCP connection

Open yangbinma opened this issue 3 years ago • 1 comments

Thank you for taking the time to submit an issue!

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

mpirun (Open MPI) 4.1.2rc4

Describe how Open MPI was installed

from source

Please describe the system on which you are running

  • Operating system/version: ubuntu20.04
  • Computer hardware:
  • Network type:

Details of the problem

ubuntu@master:~$ mpirun --mca plm_base_verbose 10 --host worker1 hostname
[master:1660837] mca: base: components_register: registering framework plm components
[master:1660837] mca: base: components_register: found loaded component rsh
[master:1660837] mca: base: components_register: component rsh register function successful
[master:1660837] mca: base: components_register: found loaded component isolated
[master:1660837] mca: base: components_register: component isolated has no register or open function
[master:1660837] mca: base: components_register: found loaded component slurm
[master:1660837] mca: base: components_register: component slurm register function successful
[master:1660837] mca: base: components_open: opening plm components
[master:1660837] mca: base: components_open: found loaded component rsh
[master:1660837] mca: base: components_open: component rsh open function successful
[master:1660837] mca: base: components_open: found loaded component isolated
[master:1660837] mca: base: components_open: component isolated open function successful
[master:1660837] mca: base: components_open: found loaded component slurm
[master:1660837] mca: base: components_open: component slurm open function successful
[master:1660837] mca:base:select: Auto-selecting plm components
[master:1660837] mca:base:select:(  plm) Querying component [rsh]
[master:1660837] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[master:1660837] mca:base:select:(  plm) Querying component [isolated]
[master:1660837] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[master:1660837] mca:base:select:(  plm) Querying component [slurm]
[master:1660837] mca:base:select:(  plm) Selected component [rsh]
[master:1660837] mca: base: close: component isolated closed
[master:1660837] mca: base: close: unloading component isolated
[master:1660837] mca: base: close: component slurm closed
[master:1660837] mca: base: close: unloading component slurm
[master:1660837] [[3347,0],0] plm:rsh: final template argv:
	/usr/bin/ssh <template>           PATH=/home/ubuntu/mpi_shared_apps/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/ubuntu/mpi_shared_apps/ompi/lib:${LD_LIBRARY_PATH:-} ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/ubuntu/mpi_shared_apps/ompi/lib:${DYLD_LIBRARY_PATH:-} ; export DYLD_LIBRARY_PATH ;   /home/ubuntu/mpi_shared_apps/ompi/bin/orted -mca ess "env" -mca ess_base_jobid "219348992" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "master,worker[1:1]@0(2)" -mca orte_hnp_uri "219348992.0;tcp://127.0.0.1:41995" --mca plm_base_verbose "10" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "219348992.0;tcp://127.0.0.1:41995" -mca pmix "^s1,s2,cray,isolated"
[worker1:17189] mca: base: components_register: registering framework plm components
[worker1:17189] mca: base: components_register: found loaded component rsh
[worker1:17189] mca: base: components_register: component rsh register function successful
[worker1:17189] mca: base: components_open: opening plm components
[worker1:17189] mca: base: components_open: found loaded component rsh
[worker1:17189] mca: base: components_open: component rsh open function successful
[worker1:17189] mca:base:select: Auto-selecting plm components
[worker1:17189] mca:base:select:(  plm) Querying component [rsh]
[worker1:17189] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[worker1:17189] mca:base:select:(  plm) Selected component [rsh]
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    worker1
  Remote host:   master
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[worker1:17189] mca: base: close: component rsh closed
[worker1:17189] mca: base: close: unloading component rsh
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
[master:1660837] mca: base: close: component rsh closed
[master:1660837] mca: base: close: unloading component rsh

yangbinma avatar Jun 29 '22 20:06 yangbinma

If I use apt install openmpi-bin,libopenmpi-dev, there is no issue at all, even though the firewall setup is the same. So I think I configure sth wrong in the above case. ""

mpirun --mca plm_base_verbose 10 --host worker1 hostname [master:12357] mca: base: components_register: registering framework plm components [master:12357] mca: base: components_register: found loaded component rsh [master:12357] mca: base: components_register: component rsh register function successful [master:12357] mca: base: components_register: found loaded component isolated [master:12357] mca: base: components_register: component isolated has no register or open function [master:12357] mca: base: components_register: found loaded component slurm [master:12357] mca: base: components_register: component slurm register function successful [master:12357] mca: base: components_open: opening plm components [master:12357] mca: base: components_open: found loaded component rsh [master:12357] mca: base: components_open: component rsh open function successful [master:12357] mca: base: components_open: found loaded component isolated [master:12357] mca: base: components_open: component isolated open function successful [master:12357] mca: base: components_open: found loaded component slurm [master:12357] mca: base: components_open: component slurm open function successful [master:12357] mca:base:select: Auto-selecting plm components [master:12357] mca:base:select:( plm) Querying component [rsh] [master:12357] mca:base:select:( plm) Query of component [rsh] set priority to 10 [master:12357] mca:base:select:( plm) Querying component [isolated] [master:12357] mca:base:select:( plm) Query of component [isolated] set priority to 0 [master:12357] mca:base:select:( plm) Querying component [slurm] [master:12357] mca:base:select:( plm) Selected component [rsh] [master:12357] mca: base: close: component isolated closed [master:12357] mca: base: close: unloading component isolated [master:12357] mca: base: close: component slurm closed [master:12357] mca: base: close: unloading component slurm [master:12357] [[27370,0],0] plm:rsh: final template argv: /usr/bin/ssh

yangbinma avatar Jun 30 '22 06:06 yangbinma