skypilot
skypilot copied to clipboard
Unnecessary 15+ mins is spent on Ray ssh attempts to a dead VM on GCP during spot recovery
I've seen multiple times that Ray autoscaler wasting 15+mins during spot recovery on ssh login to a dead VM on GCP. This brings a significant delay for spot recovery (from 15mins to 30mins).
The reason is Ray wrongly assumed the VM is still alive.
According to https://cloud.google.com/compute/docs/instances/preemptible#preemption-process
Compute Engine performs the following steps to preempt an instance:
1. Compute Engine sends a preemption notice to the instance in the form of an ACPI G2 Soft Off signal. You can use a shutdown script to handle the preemption notice and complete cleanup actions before the instance stops.
2. If the instance does not stop after 30 seconds, Compute Engine sends an ACPI G3 Mechanical Off signal to the operating system.
3. Compute Engine transitions the instance to a TERMINATED state.
However, there is a significant time gap (~1 min, observed empirically) between step 2 and 3. Hence ray autoscaler wrongly thought that the VM is still in RUNNING state and tried to ssh login. It kept trying for 15mins and failed.
ssh: connect to host 35.222.69.89 port 22: Connection timed out ssh: connect to host 35.222.69.89 port 22: Connection timed out
ssh: connect to host 35.222.69.89 port 22: Connection timed out
ssh: connect to host 35.222.69.89 port 22: Connection timed out
ssh: connect to host 35.222.69.89 port 22: Connection timed out ssh: connect to host 35.222.69.89 port 22: Connection timed out Exception in thread Thread-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run() File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 134, in run self.do_update()
File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 299, in do_update
self.wait_ready(deadline)
File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 248, in wait_ready raise Exception("wait_ready timeout exceeded.") Exception: wait_ready timeout exceeded.
2022-05-16 22:22:04,680 PANIC commands.py:724 -- Failed to setup head node.
Error: Failed to setup head node.
In fact Autoscaler already implemented similar logic here to detect vm states before each ssh attempts. However, the VM states they used to check is cached (code here) and can be out-to-date. They do not refresh VM's states by calling GCP's API.
I discussed with @Michaelvll about possible solutions. We think a simple way is to patch Ray's autoscaler by adding
self.provider.non_terminated_nodes({})
before https://github.com/ray-project/ray/blob/6d978ab10ec65da1018790f8605b5b8946e838e5/python/ray/autoscaler/_private/updater.py#L272.
non_terminated_nodes()
will refresh the states of self.cached_nodes
and then ray will be able to detect whether the VM is terminated during ssh retry.