skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Unnecessary 15+ mins is spent on Ray ssh attempts to a dead VM on GCP during spot recovery

Open infwinston opened this issue 2 years ago • 0 comments

I've seen multiple times that Ray autoscaler wasting 15+mins during spot recovery on ssh login to a dead VM on GCP. This brings a significant delay for spot recovery (from 15mins to 30mins).

The reason is Ray wrongly assumed the VM is still alive.

According to https://cloud.google.com/compute/docs/instances/preemptible#preemption-process

Compute Engine performs the following steps to preempt an instance:

1. Compute Engine sends a preemption notice to the instance in the form of an ACPI G2 Soft Off signal. You can use a shutdown script to handle the preemption notice and complete cleanup actions before the instance stops.
2. If the instance does not stop after 30 seconds, Compute Engine sends an ACPI G3 Mechanical Off signal to the operating system.
3. Compute Engine transitions the instance to a TERMINATED state.

However, there is a significant time gap (~1 min, observed empirically) between step 2 and 3. Hence ray autoscaler wrongly thought that the VM is still in RUNNING state and tried to ssh login. It kept trying for 15mins and failed.

ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                                                                             ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                                                                             
ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                   
ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                                                                             
ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                                                                             ssh: connect to host 35.222.69.89 port 22: Connection timed out                                                                                                                             Exception in thread Thread-1:                                                                                                                                                               
Traceback (most recent call last):                                                                                                
  File "/opt/conda/lib/python3.9/threading.py", line 954, in _bootstrap_inner                                                                                                               
    self.run()                                                                                                                                                                                File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 134, in run                                                                                   self.do_update()                                                                                                                                                                        
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 299, in do_update               
    self.wait_ready(deadline)                                                                                                                                                               
  File "/home/ubuntu/.local/lib/python3.9/site-packages/ray/autoscaler/_private/updater.py", line 248, in wait_ready                                                                            raise Exception("wait_ready timeout exceeded.")                                                                                                                                         Exception: wait_ready timeout exceeded.                                                                                                                                                     
2022-05-16 22:22:04,680 PANIC commands.py:724 -- Failed to setup head node.                                                       
Error: Failed to setup head node.  

In fact Autoscaler already implemented similar logic here to detect vm states before each ssh attempts. However, the VM states they used to check is cached (code here) and can be out-to-date. They do not refresh VM's states by calling GCP's API.

I discussed with @Michaelvll about possible solutions. We think a simple way is to patch Ray's autoscaler by adding

self.provider.non_terminated_nodes({})

before https://github.com/ray-project/ray/blob/6d978ab10ec65da1018790f8605b5b8946e838e5/python/ray/autoscaler/_private/updater.py#L272.

non_terminated_nodes() will refresh the states of self.cached_nodes and then ray will be able to detect whether the VM is terminated during ssh retry.

infwinston avatar May 17 '22 20:05 infwinston