[Bug][High Availability] 502 Errors while Head Node in Recovery
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
What you expected to happen
I expect while a head node is terminated that there will be no drop in requests availability when requesting the Kubernetes service for a RayService
What Happened
Following the HA Guide for Ray Serve + Kuberay, I tested dropping the head node pod while issuing requests to the Kubernetes Service -> {cluster_name}-serve-svc
Intermittently, I received 502 errors. About 1/5th of requests for a few seconds while the head pod recovers.
However, if I follow the guide and port-forward to a Worker pod, I do not receive any 502 errors.
Hypothesis
Since the Kubernetes service ({cluster_name}-serve-svc) is pointing to worker pods (no head pod), this leads me to believe the 502 errors happen during some transient state induced by a reaction in Kuberay or Ray serve.
Reproduction script
Run a simple request loop while tearing down the head node pod.
import time
import requests
url = "http://127.0.0.1:8000/dummy"
while True:
resp = requests.post(url=url, json={"test": "test"})
print(resp.status_code)
time.sleep(0.1)
Anything else
Using Ray v2.4.0 & Kuberay Nightly @ bc6be0ee3b513648ea929961fed3288164c9fc46
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Assuming you're using the code and config from the HA guide, these failures may be caused by requests that were being processed on the head node when it was killed.
Could you retry this experiment, with two changes:
- Set
numCpusfor theSleepyPidreplicas to 0.1. - Set
num-cpuson the head node to '0' instead of '2'.
This will block any replicas from running on the head node, so the failures should go away.
For context, these failures may come from the handful of requests that are (1) handled by the HTTP proxy on the head node and (2) also processed by replicas on the head node. When the head node crashes, the proxy and these replicas will crash, so some requests will fail without being retried by the proxy. You can avoid this by running deployment replicas only on worker nodes.
Hey @shrekris-anyscale,
Unfortunately, I'm already using num-cpus: 0 on the head node. I'll try to post a minimal example.
In the ray dashboard for example, I see all the workloads on the worker node, and none on the head node:
ray::HTTPProxyActor,
ray::ServeReplica:Dummy
ray::ServeReplica:DAGDriver
ray::ServeController
The deployment numCpus settings are this if it matters:
Dummy (hello world) -> numCpus: 0.4 and
DAGDriver -> numCpus: 0.1
You're right that num-cpus: 0 really should be recommended as best practice for HA. We should update the sample yaml file and add a comment.
https://github.com/ray-project/ray/blob/93c05d1d4a19d423acfc8671251a95221e6e0980/doc/source/serve/doc_code/fault_tolerance/k8s_config.yaml#L87
Thanks for following up! Roughly how long do you see 502 errors for, and how many errors do you see? Do they only stop when the head pod fully recovers, or do they stop before that?
I believe what may be happening is that after the head pod crashes, there's a brief period where the service continues to try sending requests to the head pod's HTTP Proxy, causing 502 errors. This should stop once the service recognizes that the head pod is dead and stops sending requests to it.
A small number of failures after a pod crashes is expected behavior, since the service needs a bit of time to recognize that the pod has crashed. Generally, we recommend adding a few client-side retries to handle this case.
how long do you see 502 errors for, and how many errors do you see? Do they only stop when the head pod fully recovers, or do they stop before that?
The errors happen for ~25% of requests during a short period of time, under 10 seconds.
The 502 errors stop while the pod is still in a Terminating state. The errors happen after a delete pod... command is executeed.
there's a brief period where the service continues to try sending requests to the head pod's HTTP Proxy
Yes I believe this is the case. Is there a reason it requests to the head node at all if num-cpus is zero?
needs a bit of time to recognize that the pod has crashed
Curious, which system is in charge of this mechanism? Kuberay, Ray Serve, or Kubernetes service itself?
Note too: If following the HA guide's suggestion to port-forward to the worker then no 502 errors are observed.
🤔 Is there a way to exclude the head node as part of the service to take it out of the request loop?
The 502 errors stop while the pod is still in a
Terminatingstate.
Yeah, this is likely because of the update period where the service recognizes that the head pod is down and stops sending requests to it.
Is there a reason it requests to the head node at all if
num-cpusis zero?
Yes, the head node still contains an HTTP Proxy, even if num-cpus is zero. Some requests get routed to that proxy and then forwarded to the worker nodes.
Is there a way to exclude the head node as part of the service to take it out of the request loop?
There's no API to do this in KubeRay yet, but it is possible to add one. However, you would see this failure pattern if any node (including worker nodes) fails since all nodes contain an HTTP Proxy. I'm not sure if you would see sizeable availability gains by removing the proxy from the head node. Either way, I'd recommend adding a client-side retry policy to handle transient failures like this.
Curious, which system is in charge of this mechanism? Kuberay, Ray Serve, or Kubernetes service itself?
It's a combination of KubeRay and the Kubernetes service itself. KubeRay uses a Label Selector to determine which pods to route requests to. On every reconciliation loop, the KubeRay RayService controller checks whether or not a pod's HTTP Proxy is reachable. If it is not, the controller changes the pod's label, and then Kubernetes' label selector logic updates the service, so it stops sending requests to the pod.
The errors happen for ~25% of requests during a short period of time, under 10 seconds.
10 seconds does feel a bit long though. I manually experimented with a 2-node KubeRay cluster, and when I killed a pod, the Kubernetes service's endpoints list updated almost immediately. I'd expect almost no failures from requests that get sent after the head pod crashes.
Perhaps some of the requests that failed were requests that had been queued at the head node's HTTP proxy when it crashed? That might explain why the failures happened over a period of ~10 seconds.
I can confirm that I am seeing around 10 seconds of 502s as well. I am using the sleepypid example on EKS with NLB + Nginx as ingress (if that helps).
@askulkarni2 Are you seeing any failures from requests that were launched after the head node crashed? Or are all the failures from requests that started before the head node crashed?
I run a simple while loop and I bring down the head node pod. As soon as I kill it, start seeing 502s. I see the head node pod recover yet 502s continue for a while (~10 seconds) and then back to 200s.
Do you mind retrying that experiment by first killing the head node pod and then starting the while loop? Do you still see ~10 seconds of 502's in that case?
I synced with @edoakes offline. Another potential reason that could cause this is a stale DNS cache. The client that's sending requests to the K8s service is likely caching IP addresses that the service hostname resolves to. There's probably a period of time after the head node crashes before that cache updates to remove the head node's IP address.
Any update on this? We've been running into this too. Sometimes, when a head node goes down, a new head node spins up and connects to the existing workers and we don't see many failed requests (only the ongoing requests on the head node).
But sometimes KubeRay spins up an entirely new cluster, and as soon as the head node is up, it switches over traffic to the new cluster resulting in a bunch of 502s since applications aren't ready on the new cluster.
I haven't figured out when KubeRay spins up a new cluster vs. not yet (it seems random to me at this point) but it is a big stability issue for us.
Do you have external redis setup @smit-kiri ?
Do you have external redis setup @smit-kiri ?
Yes!
The issue is still happening with KubeRay v1.0.0 and Ray 2.9.3
The interesting part is that when the head node was down, everything was fine. I saw 502s for a few seconds after the head node started running again.
I'm using code mentioned in this issue as serve application code. And a quick testing script
import requests
from collections import defaultdict
url = "http://my-ingress-url/test-app-2"
status = defaultdict(int)
while True:
try:
response = requests.post(url, json="hello", timeout=0.5)
status[response.status_code] += 1
except Exception:
status["timeout"] += 1
print(status, end="\r")
time.sleep(0.5)
I drain the node that the head pod is running on to simulate a failure
@smit-kiri, would you mind trying https://github.com/ray-project/kuberay/pull/1986 to see what happens? @Yicheng-Lu-llll has already conducted some manual RayService HA tests for KubeRay v1.1.0.
@smit-kiri, would you mind trying #1986 to see what happens? @Yicheng-Lu-llll has already conducted some manual RayService HA tests for KubeRay v1.1.0.
I can try that in a couple days. One thing to note is that I did not have serve autoscaling set up