elasticluster icon indicating copy to clipboard operation
elasticluster copied to clipboard

another elasticluster+slurm+gce issue

Open jrieffel opened this issue 6 years ago • 7 comments

I've been impressed with elasticluster's seamless ability to create slurm clusters in the Google Compute Environment. However, I have run across a pair of issue:

  1. when I log into the frontend node for the first time, I have to start the slurmctld.

  2. Afterwards, I can grab nodes via

srun -N 4 --pty bash

and then run them from the interactive session mpirun -np 4 ./a.out

However, some uspecified amount of time later, this same set of steps fails, with the error:

[fall18cluster-compute001:26162] [[13577,1],0] usock_peer_recv_connect_ack: received unexpected process identifier 
[[13577,0],0] from [[13577,0],1]

Any tips on how to approach debugging this issue? It seems as if, at some point, the hostname of the compute nodes is changing from compute001 to fall18cluster-compute001, etc.

Thanks!

jrieffel avatar Sep 20 '18 13:09 jrieffel

when I log into the frontend node for the first time, I have to start the slurmctld.

This is definitely a bug. Does this happen when you first spin up the cluster, or only after a resize?

some uspecified amount of time later, this same set of steps fails, with the error:

The error message seems to come from OpenMPI, see: https://github.com/open-mpi/ompi/issues/2328 From the comments there, it looks like the issue is indeed due to host names mismatch.

On OpenStack, the host name is only changed at boot time and ElastiCluster already takes care of it; but maybe GCE does a different thing and resets the host name every time the DHCP lease is re-negotiated.

riccardomurri avatar Sep 20 '18 14:09 riccardomurri

when I log into the frontend node for the first time, I have to start the slurmctld.

This is definitely a bug. Does this happen when you first spin up the cluster, or only after a resize?

After spinup

some uspecified amount of time later, this same set of steps fails, with the error:

The error message seems to come from OpenMPI, see: open-mpi/ompi#2328 From the comments there, it looks like the issue is indeed due to host names mismatch.

On OpenStack, the host name is only changed at boot time and ElastiCluster already takes care of it; but maybe GCE does a different thing and resets the host name every time the DHCP lease is re-negotiated.

This does seem to be an issue with GCE. I fixed the problem temporarily by sshing to each node and resetting its hostname via hostnamectl. I was subsequently able to run MPI jobs. I think I can make this fix permanent by adding a corresponding script to /etc/dhcp/

jrieffel avatar Sep 20 '18 15:09 jrieffel

Re: restarting slurmctld: I cannot reproduce this issue; if I start a cluster on GCP, SLURM is functional and I can successfully submit a test job without touching slurmctld. What base OS are you using? How does your config look like?

riccardomurri avatar Sep 20 '18 19:09 riccardomurri

Re: nodes changing the host name: it looks to me that the culprit is (on Google's official Ubuntu 16.04 "xenial" image) script /etc/dhcp/dhclient-exit-hooks.d/google_set_hostname, which contains the following remark and code (line 36--48)::

# As a result, we set the host name in all circumstances here, to the truncated
# unqualified domain name.

if [ -n "$new_host_name" ]; then
  hostname "${new_host_name%%.*}"

  # If NetworkManager is installed set the hostname with nmcli.
  # to resolve issues with NetworkManager resetting the hostname
  # to the FQDN on DHCP renew.
  if ...
    nmcli general hostname "${new_host_name%%.*}"
  fi
...

I'll check whether this is an issue that we can solve by configuring SLURM or OpenMPI slightly differently, or if we need to overwrite/patch this script from Google.

riccardomurri avatar Sep 20 '18 19:09 riccardomurri

Yes - the other people have reported this behavior with GCE instances. I can "temporarily" fix things by calling hostnamectl on all the affected nodes, but as soon as they grab a new DHCP the hostname resets.

Although fault is clearly GCE's, I wonder if there is some brittleness on the MPI side of things. The error MPI users get, when the hostnames have changed, is:

usock_peer_recv_connect_ack: received unexpected process identifier

The /etc/hosts lines look like this:

10.142.0.5 compute001 fall18cluster-compute001.c.csc333-f18.internal fall18cluster-compute001 10.142.0.6 compute002 fall18cluster-compute002.c.csc333-f18.internal fall18cluster-compute002 10.142.0.4 compute003 fall18cluster-compute003.c.csc333-f18.internal fall18cluster-compute003 10.142.0.3 compute004 fall18cluster-compute004.c.csc333-f18.internal fall18cluster-compute004 etc

And the salient portions of my elastlicluster config are:

[setup/ansible] ansible_forks=20 ansible_timeout=200 [setup/ansible-slurm] provider=ansible frontend_groups=slurm_master compute_groups=slurm_worker [cluster/fall18cluster] cloud=google login=google setup=ansible-slurm security_group=default image_id=debian-9-stretch-v20180716 flavor=g1-small frontend_nodes=1 compute_nodes=8 ssh_to=frontend boot_disk_size=50 [cluster/fall18cluster/frontend] boot_disk_type=pd-standard boot_disk_size=50

jrieffel avatar Sep 24 '18 21:09 jrieffel

I have a tentative fix for this in https://github.com/gc3-uzh-ch/elasticluster/pull/594 -- would you be able to test that branch of code?

(Sorry for the long delay in getting to fix this; I've had a very busy fall semester, and also somehow this issue fell off my radar.)

riccardomurri avatar Dec 27 '18 16:12 riccardomurri

Yes I probably could, this week or next. I'll have to re up my Google compute credits.

On Dec 27, 2018, at 11:33 AM, Riccardo Murri [email protected] wrote:

I have a tentative fix for this in #594 -- would you be able to test that branch of code?

(Sorry for the long delay in getting to fix this; I've had a very busy fall semester, and also somehow this issue fell off my radar.)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jrieffel avatar Dec 27 '18 17:12 jrieffel