kuberay [Bug][High Availability] 502 Errors while Head Node in Recovery

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

What you expected to happen

I expect while a head node is terminated that there will be no drop in requests availability when requesting the Kubernetes service for a RayService

What Happened

Following the HA Guide for Ray Serve + Kuberay, I tested dropping the head node pod while issuing requests to the Kubernetes Service -> {cluster_name}-serve-svc

Intermittently, I received 502 errors. About 1/5th of requests for a few seconds while the head pod recovers.

However, if I follow the guide and port-forward to a Worker pod, I do not receive any 502 errors.

Hypothesis

Since the Kubernetes service ({cluster_name}-serve-svc) is pointing to worker pods (no head pod), this leads me to believe the 502 errors happen during some transient state induced by a reaction in Kuberay or Ray serve.

Reproduction script

Run a simple request loop while tearing down the head node pod.

import time

import requests

url = "http://127.0.0.1:8000/dummy"

while True:
    resp = requests.post(url=url, json={"test": "test"})
    print(resp.status_code)
    time.sleep(0.1)

Anything else

Using Ray v2.4.0 & Kuberay Nightly @ bc6be0ee3b513648ea929961fed3288164c9fc46

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Jun 09 '23 07:06 bewestphal

Assuming you're using the code and config from the HA guide, these failures may be caused by requests that were being processed on the head node when it was killed.

Could you retry this experiment, with two changes:

Set numCpus for the SleepyPid replicas to 0.1.
Set num-cpus on the head node to '0' instead of '2'.

This will block any replicas from running on the head node, so the failures should go away.

For context, these failures may come from the handful of requests that are (1) handled by the HTTP proxy on the head node and (2) also processed by replicas on the head node. When the head node crashes, the proxy and these replicas will crash, so some requests will fail without being retried by the proxy. You can avoid this by running deployment replicas only on worker nodes.

Jun 09 '23 16:06 shrekris-anyscale

Hey @shrekris-anyscale,

Unfortunately, I'm already using num-cpus: 0 on the head node. I'll try to post a minimal example.

In the ray dashboard for example, I see all the workloads on the worker node, and none on the head node:

ray::HTTPProxyActor,
ray::ServeReplica:Dummy
ray::ServeReplica:DAGDriver
ray::ServeController

The deployment numCpus settings are this if it matters: Dummy (hello world) -> numCpus: 0.4 and DAGDriver -> numCpus: 0.1

You're right that num-cpus: 0 really should be recommended as best practice for HA. We should update the sample yaml file and add a comment.

https://github.com/ray-project/ray/blob/93c05d1d4a19d423acfc8671251a95221e6e0980/doc/source/serve/doc_code/fault_tolerance/k8s_config.yaml#L87

Jun 09 '23 17:06 bewestphal

Thanks for following up! Roughly how long do you see 502 errors for, and how many errors do you see? Do they only stop when the head pod fully recovers, or do they stop before that?

I believe what may be happening is that after the head pod crashes, there's a brief period where the service continues to try sending requests to the head pod's HTTP Proxy, causing 502 errors. This should stop once the service recognizes that the head pod is dead and stops sending requests to it.

A small number of failures after a pod crashes is expected behavior, since the service needs a bit of time to recognize that the pod has crashed. Generally, we recommend adding a few client-side retries to handle this case.

Jun 09 '23 18:06 shrekris-anyscale

how long do you see 502 errors for, and how many errors do you see? Do they only stop when the head pod fully recovers, or do they stop before that?

The errors happen for ~25% of requests during a short period of time, under 10 seconds.

The 502 errors stop while the pod is still in a Terminating state. The errors happen after a delete pod... command is executeed.

there's a brief period where the service continues to try sending requests to the head pod's HTTP Proxy

Yes I believe this is the case. Is there a reason it requests to the head node at all if num-cpus is zero?

needs a bit of time to recognize that the pod has crashed

Curious, which system is in charge of this mechanism? Kuberay, Ray Serve, or Kubernetes service itself?

Jun 10 '23 02:06 bewestphal

Note too: If following the HA guide's suggestion to port-forward to the worker then no 502 errors are observed.

🤔 Is there a way to exclude the head node as part of the service to take it out of the request loop?

Jun 10 '23 02:06 bewestphal

The 502 errors stop while the pod is still in a Terminating state.

Yeah, this is likely because of the update period where the service recognizes that the head pod is down and stops sending requests to it.

Is there a reason it requests to the head node at all if num-cpus is zero?

Yes, the head node still contains an HTTP Proxy, even if num-cpus is zero. Some requests get routed to that proxy and then forwarded to the worker nodes.

Is there a way to exclude the head node as part of the service to take it out of the request loop?

There's no API to do this in KubeRay yet, but it is possible to add one. However, you would see this failure pattern if any node (including worker nodes) fails since all nodes contain an HTTP Proxy. I'm not sure if you would see sizeable availability gains by removing the proxy from the head node. Either way, I'd recommend adding a client-side retry policy to handle transient failures like this.

Curious, which system is in charge of this mechanism? Kuberay, Ray Serve, or Kubernetes service itself?

It's a combination of KubeRay and the Kubernetes service itself. KubeRay uses a Label Selector to determine which pods to route requests to. On every reconciliation loop, the KubeRay RayService controller checks whether or not a pod's HTTP Proxy is reachable. If it is not, the controller changes the pod's label, and then Kubernetes' label selector logic updates the service, so it stops sending requests to the pod.

Jun 12 '23 18:06 shrekris-anyscale

The errors happen for ~25% of requests during a short period of time, under 10 seconds.

10 seconds does feel a bit long though. I manually experimented with a 2-node KubeRay cluster, and when I killed a pod, the Kubernetes service's endpoints list updated almost immediately. I'd expect almost no failures from requests that get sent after the head pod crashes.

Perhaps some of the requests that failed were requests that had been queued at the head node's HTTP proxy when it crashed? That might explain why the failures happened over a period of ~10 seconds.

Jun 12 '23 18:06 shrekris-anyscale

I can confirm that I am seeing around 10 seconds of 502s as well. I am using the sleepypid example on EKS with NLB + Nginx as ingress (if that helps).

Jun 13 '23 15:06 askulkarni2

@askulkarni2 Are you seeing any failures from requests that were launched after the head node crashed? Or are all the failures from requests that started before the head node crashed?

Jun 13 '23 16:06 shrekris-anyscale

I run a simple while loop and I bring down the head node pod. As soon as I kill it, start seeing 502s. I see the head node pod recover yet 502s continue for a while (~10 seconds) and then back to 200s.

Jun 13 '23 16:06 askulkarni2

Do you mind retrying that experiment by first killing the head node pod and then starting the while loop? Do you still see ~10 seconds of 502's in that case?

Jun 13 '23 16:06 shrekris-anyscale

I synced with @edoakes offline. Another potential reason that could cause this is a stale DNS cache. The client that's sending requests to the K8s service is likely caching IP addresses that the service hostname resolves to. There's probably a period of time after the head node crashes before that cache updates to remove the head node's IP address.

Jun 13 '23 17:06 shrekris-anyscale

Any update on this? We've been running into this too. Sometimes, when a head node goes down, a new head node spins up and connects to the existing workers and we don't see many failed requests (only the ongoing requests on the head node).

But sometimes KubeRay spins up an entirely new cluster, and as soon as the head node is up, it switches over traffic to the new cluster resulting in a bunch of 502s since applications aren't ready on the new cluster.

I haven't figured out when KubeRay spins up a new cluster vs. not yet (it seems random to me at this point) but it is a big stability issue for us.

Sep 14 '23 14:09 smit-kiri

Do you have external redis setup @smit-kiri ?

Sep 21 '23 20:09 akshay-anyscale

Do you have external redis setup @smit-kiri ?

Yes!

Sep 22 '23 14:09 smit-kiri

The issue is still happening with KubeRay v1.0.0 and Ray 2.9.3

The interesting part is that when the head node was down, everything was fine. I saw 502s for a few seconds after the head node started running again.

I'm using code mentioned in this issue as serve application code. And a quick testing script

import requests
from collections import defaultdict

url = "http://my-ingress-url/test-app-2"
status = defaultdict(int)

while True:
    try:
        response = requests.post(url, json="hello", timeout=0.5)
        status[response.status_code] += 1
    except Exception:
        status["timeout"] += 1
    
    print(status, end="\r")
    time.sleep(0.5)

I drain the node that the head pod is running on to simulate a failure

Mar 27 '24 22:03 smit-kiri

@smit-kiri, would you mind trying https://github.com/ray-project/kuberay/pull/1986 to see what happens? @Yicheng-Lu-llll has already conducted some manual RayService HA tests for KubeRay v1.1.0.

Mar 28 '24 00:03 kevin85421

@smit-kiri, would you mind trying #1986 to see what happens? @Yicheng-Lu-llll has already conducted some manual RayService HA tests for KubeRay v1.1.0.

I can try that in a couple days. One thing to note is that I did not have serve autoscaling set up

Mar 28 '24 13:03 smit-kiri