volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Delay Launcher Until Workers Are Running (MPI)

Open KeithBallard opened this issue 3 years ago • 5 comments

With MPI jobs, it is preferred that the launcher container does not start until the workers are up and running. Otherwise, the MPI call is guaranteed to fail. Our workaround has been adding:

# Read the lists of worker hostnames and try to ssh into each, forcing this container
# to wait until all workers are up and running before continuing.
for host in $(cat /etc/volcano/worker.host); do ssh $host date; while test $? -gt 0; do echo "Could not reach host '$host', trying again in 5s."; sleep 5; ssh $host date; done; echo "$host is ready."; done;

to the launcher's command string, which keeps trying to ssh into the workers one by one until all succeed. This essentially previous any following commands from running on the process until the workers are available. However, it would be convenient if volcano wants to cater to MPI jobs to have occur in the background automatically.

KeithBallard avatar Aug 01 '22 19:08 KeithBallard

@KeithBallard Hey. Please have a try with MPI plugin and task dependency.

Thor-wl avatar Aug 02 '22 00:08 Thor-wl

I have volcano scheduling mpi jobs in my cluster. The shipped example/integrations/mpi/mpi-example.yaml works well. I tried running with the mpi plugin instead and hit issues with the master pod failing name resolution of the worker pods. The YAML used was the one in docs/user-guide/how_to_use_mpi_plugin.md. The worker pods are running, but the master cannot resolve their names.

ssh: Could not resolve hostname lm-mpi-job-mpiworker-0: Name or service not known
ssh: Could not resolve hostname lm-mpi-job-mpiworker-1: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   lm-mpi-job-mpimaster-0
  target node:  lm-mpi-job-mpiworker-1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

The intent was to check whether task dependency works with the mpi plugin, as it doesn't when manually using the ssh and svc plugins. I'm using the 1.6 release.

nysal avatar Sep 06 '22 15:09 nysal

I had some similar issues but eventually found workarounds.

First, regarding the mpi plugin, @nysal . It worked fine for me with:

plugins:
    mpi: ["--master=launcher","--worker=worker","--port=22"]  ## MPI plugin register

And ensuring the tasks names are exactly "launcher" and "worker".

Second, there are two issues with the dependency syntax:

  1. The syntax is wrong in the documentation. It should be:
dependsOn:
        name:
          - worker
  1. The minAvailable value for the spec creates issues. It must be set to exactly how many workers there are (NOT including the launcher task). Otherwise, it will hang forever and the workers will stay pending. This of course should be rethought since you would want there to be enough slots for the launcher too. Workers running without a launcher is a lot of wasted resources.

@Thor-wl sorry it took me so long to test it, but those two issues above need to be addressed for the wider community at some point for the functionality to be adopted.

KeithBallard avatar Sep 06 '22 15:09 KeithBallard

@KeithBallard are you using v1.6 or a recent master branch build?

nysal avatar Sep 06 '22 16:09 nysal

I am using v1.6

KeithBallard avatar Sep 06 '22 16:09 KeithBallard

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Dec 21 '22 00:12 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Mar 23 '23 04:03 stale[bot]