elasticluster
elasticluster copied to clipboard
another elasticluster+slurm+gce issue
I've been impressed with elasticluster's seamless ability to create slurm clusters in the Google Compute Environment. However, I have run across a pair of issue:
-
when I log into the frontend node for the first time, I have to start the slurmctld.
-
Afterwards, I can grab nodes via
srun -N 4 --pty bash
and then run them from the interactive session mpirun -np 4 ./a.out
However, some uspecified amount of time later, this same set of steps fails, with the error:
[fall18cluster-compute001:26162] [[13577,1],0] usock_peer_recv_connect_ack: received unexpected process identifier
[[13577,0],0] from [[13577,0],1]
Any tips on how to approach debugging this issue? It seems as if, at some point, the hostname of the compute nodes is changing from compute001 to fall18cluster-compute001, etc.
Thanks!
when I log into the frontend node for the first time, I have to start the slurmctld.
This is definitely a bug. Does this happen when you first spin up the cluster, or only after a resize?
some uspecified amount of time later, this same set of steps fails, with the error:
The error message seems to come from OpenMPI, see: https://github.com/open-mpi/ompi/issues/2328 From the comments there, it looks like the issue is indeed due to host names mismatch.
On OpenStack, the host name is only changed at boot time and ElastiCluster already takes care of it; but maybe GCE does a different thing and resets the host name every time the DHCP lease is re-negotiated.
when I log into the frontend node for the first time, I have to start the slurmctld.
This is definitely a bug. Does this happen when you first spin up the cluster, or only after a resize?
After spinup
some uspecified amount of time later, this same set of steps fails, with the error:
The error message seems to come from OpenMPI, see: open-mpi/ompi#2328 From the comments there, it looks like the issue is indeed due to host names mismatch.
On OpenStack, the host name is only changed at boot time and ElastiCluster already takes care of it; but maybe GCE does a different thing and resets the host name every time the DHCP lease is re-negotiated.
This does seem to be an issue with GCE. I fixed the problem temporarily by sshing to each node and resetting its hostname via hostnamectl
. I was subsequently able to run MPI jobs. I think I can make this fix permanent by adding a corresponding script to /etc/dhcp/
Re: restarting slurmctld
: I cannot reproduce this issue; if I start a cluster on GCP, SLURM is functional and I can successfully submit a test job without touching slurmctld
. What base OS are you using? How does your config look like?
Re: nodes changing the host name: it looks to me that the culprit is (on Google's official Ubuntu 16.04 "xenial" image) script /etc/dhcp/dhclient-exit-hooks.d/google_set_hostname
, which contains the following remark and code (line 36--48)::
# As a result, we set the host name in all circumstances here, to the truncated
# unqualified domain name.
if [ -n "$new_host_name" ]; then
hostname "${new_host_name%%.*}"
# If NetworkManager is installed set the hostname with nmcli.
# to resolve issues with NetworkManager resetting the hostname
# to the FQDN on DHCP renew.
if ...
nmcli general hostname "${new_host_name%%.*}"
fi
...
I'll check whether this is an issue that we can solve by configuring SLURM or OpenMPI slightly differently, or if we need to overwrite/patch this script from Google.
Yes - the other people have reported this behavior with GCE instances. I can "temporarily" fix things by calling hostnamectl
on all the affected nodes, but as soon as they grab a new DHCP the hostname resets.
Although fault is clearly GCE's, I wonder if there is some brittleness on the MPI side of things. The error MPI users get, when the hostnames have changed, is:
usock_peer_recv_connect_ack: received unexpected process identifier
The /etc/hosts lines look like this:
10.142.0.5 compute001 fall18cluster-compute001.c.csc333-f18.internal fall18cluster-compute001 10.142.0.6 compute002 fall18cluster-compute002.c.csc333-f18.internal fall18cluster-compute002 10.142.0.4 compute003 fall18cluster-compute003.c.csc333-f18.internal fall18cluster-compute003 10.142.0.3 compute004 fall18cluster-compute004.c.csc333-f18.internal fall18cluster-compute004 etc
And the salient portions of my elastlicluster config are:
[setup/ansible]
ansible_forks=20
ansible_timeout=200
[setup/ansible-slurm]
provider=ansible
frontend_groups=slurm_master
compute_groups=slurm_worker
[cluster/fall18cluster]
cloud=google
login=google
setup=ansible-slurm
security_group=default
image_id=debian-9-stretch-v20180716
flavor=g1-small
frontend_nodes=1
compute_nodes=8
ssh_to=frontend
boot_disk_size=50
[cluster/fall18cluster/frontend]
boot_disk_type=pd-standard
boot_disk_size=50
I have a tentative fix for this in https://github.com/gc3-uzh-ch/elasticluster/pull/594 -- would you be able to test that branch of code?
(Sorry for the long delay in getting to fix this; I've had a very busy fall semester, and also somehow this issue fell off my radar.)
Yes I probably could, this week or next. I'll have to re up my Google compute credits.
On Dec 27, 2018, at 11:33 AM, Riccardo Murri [email protected] wrote:
I have a tentative fix for this in #594 -- would you be able to test that branch of code?
(Sorry for the long delay in getting to fix this; I've had a very busy fall semester, and also somehow this issue fell off my radar.)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.