ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Need help with PS3 Cluster.

Open asanchez500 opened this issue 7 years ago • 5 comments

Thank you for taking the time to submit an issue!

Background information

mpirun.openmpi (OpenRTE) 2.0.2

Git clone.

  • Operating system/version: Debian 9
  • Computer hardware: 64 bit
  • Network type: Ethernet

Details of the problem

root@debianz:~/Desktop/nfs/mpi-hello-world# mpirun.openmpi --allow-run-as-root -np 2 mpi_hello_world
ssh: connect to host 192.168.43.57 port 22: Connection refused
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
ssh: connect to host 192.168.43.179 port 22: No route to host
ssh: connect to host 192.168.1.136 port 22: No route to host

I am simply trying to run a test program locally and cannot get the script to even start. I already figured out how to run the NFS and SSH server on my PS3 cluster. Which is all I needed to setup apparently according to the guide I used to have the cluster. I just can't get my master node to work locally or over the network. Something is wrong with ORTE. Any help or guidance would be appreciated. Please and Thank you.

asanchez500 avatar Sep 01 '18 00:09 asanchez500

First, you should never run as root - easy to cause problems in your system. Besides, you tend to hit things like ssh refusing to connect due to root-access permissions.

Second, it appears you have a default hostfile defined? mpirun appears to be picking up hosts from somewhere - it won't just make them up. It is trying to launch daemons on them without success. One way to check it is to add --display-allocation to the cmd line and see what nodes it thinks it has been given.

Finally, you can ensure it only runs local by adding -H foo:N to the cmd line, where foo is the hostname of the local host and N is the number of slots you want to allocate to it.

rhc54 avatar Sep 01 '18 03:09 rhc54

@asanchez500 it seems you are not running under a resource manager (such as Slurm, PBSPro and others). In this case, a requirement is you must be able to ssh passwordless between nodes.

you need to authorize your public keys, allow ssh as root if this is really what you want (once again, we strongly advise you use a user account) and allow the hosts public keys. Then you should be able to

ssh 192.168.43.179 true
ssh 192.168.1.136 true

and then you can try again Open MPI

ggouaillardet avatar Sep 04 '18 00:09 ggouaillardet

Welp. I'm still stuck trying to run this locally for the most part just to see if my setup would work. I do know that the script I am running works on another machine. Just can't figure out how to run this on this master node I am setting up. Here are my errors. Any help would be appreciated. Thank you.

root@debianz:~/Desktop/nfs/mpi-hello-world#  mpirun --allow-run-as-root -H debianz:2 ./mpi_hello_world
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[debianz:6853] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[debianz:6854] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[60537,1],0]
  Exit code:    1
--------------------------------------------------------------------------

asanchez500 avatar Sep 05 '18 00:09 asanchez500

I notice in your first error message a key line in the output:

ssh: connect to host 192.168.43.57 port 22: Connection refused

This means that some machine was unable to ssh over to 192.168.53.57. Without being able to do that, Open MPI will fail. Does the .57 machine have an ssh daemon running, and allow incoming connections?

jsquyres avatar Sep 12 '18 23:09 jsquyres

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Feb 16 '24 21:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 01 '24 21:03 github-actions[bot]