kuberay
kuberay copied to clipboard
[RayService] [GCS FT] Worker nodes don't serve traffic while head node is down
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
Others
What happened + What you expected to happen
I had a Kubernetes cluster on GKE with 2 nodes that was running a RayService
. It had 2 worker pods and 1 head pod. It also had a 1-node Redis cluster configured to support GCS Fault Tolerance:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ervice-sample-raycluster-thwmr-worker-small-group-6f2pk 1/1 Running 0 6m59s 10.68.2.64 gke-serve-demo-default-pool-ed597cce-nvm2 <none> <none>
ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 1/1 Running 0 79m 10.68.2.62 gke-serve-demo-default-pool-ed597cce-nvm2 <none> <none>
rayservice-sample-raycluster-thwmr-head-28mdh 1/1 Running 1 (79m ago) 79m 10.68.0.45 gke-serve-demo-default-pool-ed597cce-pu2q <none> <none>
redis-75c8b8b65d-4qgfz 1/1 Running 0 79m 10.68.2.60 gke-serve-demo-default-pool-ed597cce-nvm2 <none> <none>
I started a port-forward to a worker pod and successfully got responses from my deployments:
$ port-forward ervice-sample-raycluster-thwmr-worker-small-group-bdv6q
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
$ curl localhost:8000
418
I then killed the head pod:
$ kubectl delete pod rayservice-sample-raycluster-thwmr-head-28mdh
pod "rayservice-sample-raycluster-thwmr-head-28mdh" deleted
Once the head pod was deleted, it started recovering:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ervice-sample-raycluster-thwmr-worker-small-group-6f2pk 1/1 Running 0 24m
ervice-sample-raycluster-thwmr-worker-small-group-bdv6q 1/1 Running 0 96m
rayservice-sample-raycluster-thwmr-head-8xjpx 0/1 ContainerCreating 0 5s
redis-75c8b8b65d-4qgfz 1/1 Running 0 96m
My port-forward did not immediately die, and the worker pod was not immediately restarted, which makes me think that GCS fault tolerance was configured correctly. However, while the head pod was recovering, all my curl
requests hung. Note: my port-forward was eventually terminated and the worker pods were restarted after the head pod came back up.
$ curl localhost:8000
Eventually, the head pod came back up, and the worker pods were restarted. After that, I could reconnect to the cluster and get successful responses from my deployments.
I can't tell if I simply misconfigured GCS fault tolerance, or if this is how GCS fault tolerance is meant to behave.
Reproduction script
Serve application: https://github.com/ray-project/serve_config_examples/blob/42d10bab77741b40d11304ad66d39a4ec2345247/sleepy_pid.py
Kubernetes config file:
kind: ConfigMap
apiVersion: v1
metadata:
name: redis-config
labels:
app: redis
data:
redis.conf: |-
port 6379
bind 0.0.0.0
protected-mode no
requirepass 5241590000000000
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: ClusterIP
ports:
- name: redis
port: 6379
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:5.0.8
command:
- "sh"
- "-c"
- "redis-server /usr/local/etc/redis/redis.conf"
ports:
- containerPort: 6379
volumeMounts:
- name: config
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
volumes:
- name: config
configMap:
name: redis-config
---
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: rayservice-sample
annotations:
ray.io/ft-enabled: "true"
spec:
serveConfig:
importPath: "sleepy_pid:app"
runtimeEnv: |
working_dir: "https://github.com/ray-project/serve_config_examples/archive/42d10bab77741b40d11304ad66d39a4ec2345247.zip"
deployments:
- name: SleepyPid
numReplicas: 6
rayActorOptions:
numCpus: 0
rayClusterConfig:
rayVersion: '2.0.0'
headGroupSpec:
serviceType: ClusterIP
replicas: 1
rayStartParams:
block: 'true'
num-cpus: '2'
object-store-memory: '100000000'
dashboard-host: '0.0.0.0'
node-ip-address: $MY_POD_IP # Auto-completed as the head pod IP
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.0.0
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RAY_REDIS_ADDRESS
value: redis:6379
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
ports:
- containerPort: 6379
name: redis
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: small-group
rayStartParams:
block: 'true'
node-ip-address: $MY_POD_IP
template:
spec:
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- name: machine-learning
image: rayproject/ray:2.0.0
imagePullPolicy: Always
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 80
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "500m"
memory: "2Gi"
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
cc @brucez-anyscale @wilsonwang371 @iycheng
@brucez-anyscale Bruce, i remember we have seen something similar to this before while using port forwarding, right?
I think @simon-mo and @iycheng have fixed this.
Can we close this issue? Thanks! @shrekris-anyscale @brucez-anyscale
Hi @kevin85421, this is still an issue, but I'm not sure if it's caused by Ray Serve itself or by KubeRay. It's somewhat mitigated by this Ray change, but I think we should leave this issue open for tracking. I've classified it as a P2.
@shrekris-anyscale what's the priority and impact of this issue now?
We've made more progress on this issue. #33384 will further reduce any downtime while the worker nodes are down. That change should ensure minimal downtime when this issue happens.
After merging that change, I'd be comfortable marking this issue as a P3, or closing it altogether.