kuberay
kuberay copied to clipboard
[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I have deploy kuberay-operator and ray-cluster in my k8s cluster, it works fine most of the time. When I use fallocate command to enforce the k8s node to disk-pressure status where the raycluster head pod runs, the raycluster head pod will be evicted by kubelet. There is no new head pod created on the other k8s node with normal status after long time. And when I clear large file created by fallocate commad and relieve the k8s node disk pressure, the raycluster pod will by still evicted and no new head pod created. So I have to manual delete the evicted head pod to make it work again, and obviously this will make our production environment unstable and the service will not be highly available. My expected behavior is that the raycluster controller works well like deployment, when the head pod been evicted, it will auto recreate new head pod.
Reproduction script
Basic version information
Kubernetes: v1.20.15 ray-operator: v1.0.0 raycluster(ray): 2.9.0
Reproduction steps
- deploy kuberay-operator and raycluster in the k8s cluster;
- find the k8s node where the raycluster head pod runs, and use fallocate command like
fallocate -l 1000G tmpfile
to make the imagefs full, and then the raycluster head pod will be evicted by kubelet as disk pressure; - wait for a while, there is no new head pod created on the other k8s node;
- release the imagefs by delete the tmpfile, and wait for a while, the raycluster head pod is still evicted, and the raycluster is still not available.
Anything else
This will happen every time when we make the raycluster head evicted.
We can find the relative log of the ray-operator pod, such as:
2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"Found 1 head Pod": "ray-cluster-head-dqtxb", "Pod status": "Failed", "Pod restart policy": "Always", "Ray container terminated status": "nil"} 2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"head Pod": "ray-cluster-head-dqtxb", "shouldDelete": false, "reason": "The status of the head Pod ray-cluster-head-dqtxb is Failed. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' and it should be able to restart automatically."}
So I read the relative code in raycluster controller, I found that the raycluster controller rely on the Pod's restart policy to restart the Pod when its status is failed. But In this case, the pod has been evicted by kubelet will not restart, so the raycluster will not work after evicted.
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Well, at least the log message clearly expresses the authors' misunderstanding of Kubernetes eviction behavior (which to be fair is extremely confusing).
@xjhust
the pod has been evicted by kubelet will not restart
The pods expelled by the K8S node will be re-scheduled. If not, you need to check if your pod has some kind of affinity set or something other that make pod not re-scheduled
@xjhust
the pod has been evicted by kubelet will not restart
The pods expelled by the K8S node will be re-scheduled.
That is not true. A pod evicted due to disk pressure is marked failed and will never be rescheduled. If the pod is part of a replicaset, then a new pod will be created to replace it. This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.
In the case where a Node is down for long enough and Kubernetes evicts the pod (i.e. deletes it), there should be some logic in KubeRay operator to recreate the head pod. As Dmitri already said, in the case of disk pressure, the Pod object is never deleted so the recreate logic never kicks in.
However, per the stackoverflow discussion shared earlier, evicted pods from disk pressure usually have a Failed
status. And there is actually logic in raycluster controller to delete / recreated Pods in failed status. See shouldDeletePod. But KubeRay's recreate logic does not apply if the head pod has a restart policy of Always
, see the comment sin shouldDeletePod
for the reason.
@xjhust it seems like if you set restart policy of the head pod to OnFailure
or Never
, then you might be able to trigger the head recreate logic on disk pressure eviction.
This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.
I suspect that if we ran the head pod as a single-replica StatefulSet, we may be able to automatically recover from some of these scenarios. @kevin85421 have we ever considered this before?
This issue is asking for the KubeRay operator to do the same thing as the replicaset controller.
I suspect that if we ran the head pod as a single-replica StatefulSet, we may be able to automatically recover from some of these scenarios. @kevin85421 have we ever considered this before?
This might not be the right move, since some users will expect the head pod to exit on failure. (You wouldn't be able to support restart policy Never.)
My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.
My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.
The comments here implies pods with restart Policy Always never reach Failed state, but maybe we never tested this with disk pressure eviction. Is someone able to verify this before we change this behavior?
My suggestion would be to modify the operator code to create a new pod if the restart policy is Always and the pod enters Failed state.
The comments here implies pods with restart Policy Always never reach Failed state, but maybe we never tested this with disk pressure eviction. Is someone able to verify this before we change this behavior?
I'm pretty sure the comment is incorrect. I agree that someone should confirm -- try inducing a disk pressure execution on a pod with restart policy Always, as described in the issue description.
@xjhust can you help confirm this?
since some users will expect the head pod to exit on failure.
@DmitriGekhtman: currently, if the head Pod's restartPolicy is Never
and the Pod becomes Failed
, KubeRay will delete the head Pod and create a new one. Is the recreate behavior OK for "exit on failure" you mentioned above?
@xjhust can you help confirm this?
Note the issue description includes a suggestion for reproduction (run the fallocate command). If you have easy access to a Kubernetes cluster, it should be straightforward to reproduce.
I've observed this particular behavior (pod with restart policy always failing due to node pressure conditions) in production systems -- usually due to a large core dump after a segfault.
I would feel okay with removing the isRestartPolicyAlways check. It seems like in most normal conditions the Pod phase never flips to Failed
, so the deletion would never happen, but in the uncommon cases like disk pressure it would trigger the delete & recreate, which can't be worse than the current state.
Try to summarize the current discussion:
- Currently, if a Pod's
restartPolicy
isAlways
, KubeRay will not delete the Pod. Instead, KubeRay waits for the Pod restarts by itself. - However, if a Pod is evicted due to disk pressure, the Pod's status will transition to
Failed
and will not be relocated to other K8s node even if therestartPolicy
isAlways
. - Our solution: if the Pod's status is
Failed
orSucceeded
, KubeRay should delete and create a new Pod no matter the value ofrestartPolicy
.
Is my understanding correct? cc @DmitriGekhtman @andrewsykim
Yes, that's my understanding too, but I personally have not tested the pod status behavior during disk eviction, it'd be good to verify this behavior. Other than that, it seems like an improvement to handle Failed
status even if restartPolicy is Always
since some users will expect the head pod to exit on failure.
@DmitriGekhtman: currently, if the head Pod's restartPolicy is
Never
and the Pod becomesFailed
, KubeRay will delete the head Pod and create a new one. Is the recreate behavior OK for "exit on failure" you mentioned above?
Thanks for clarifying this -- I misunderstood the current behavior of the Never restart policy. It sounds good to keep the current behavior in the Never restart policy case.
Our solution: if the Pod's status is Failed or Succeeded, KubeRay should delete and create a new Pod no matter the value of restartPolicy.
Is my understanding correct? cc @DmitriGekhtman @andrewsykim
That sounds like a good idea.
Context on why pods are marked failed in node pressure situations: the kubelet needs to protect the node from the faulty pod's resource usage.
The Kubernetes docs do not directly address the case of restart policy always. But the behavior can be inferred from this section: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#self-healing-behavior
"If the pods are managed by a workload management object (such as StatefulSet or Deployment) that replaces failed pods, the control plane (kube-controller-manager) creates new pods in place of the evicted pods."
You can infer that if a controller does not replace the failed pod, there will be no replacement.
If it's possible to use a 1-pod replicaset with a deterministic name for the Ray head, it would also resolve https://github.com/ray-project/kuberay/issues/715, for the Ray head.
1-pod replicaset
What's the reason for a 1 pod replicaset instead of a statefulset?
1-pod replicaset
What's the reason for a 1 pod replicaset instead of a statefulset?
Likely either would work -- we could discuss the relative merits. Main reason I'd lean towards a replicaset is that it is simpler. All we need is the ability to reliably keep exactly 1 pod running, and a replicaset is the simplest primitive that allows this. Basically, I think the capabilities and limitations of statefulsets are extraneous to KubeRay requirements; but I could be wrong!
Likely either would work -- we could discuss the relative merits. Main reason I'd lean towards a replicaset is that it is simpler. All we need is the ability to reliably keep exactly 1 pod running, and a replicaset is the simplest primitive that allows this. Basically, I think the capabilities and limitations of statefulsets are extraneous to KubeRay requirements; but I could be wrong!
I think 1-pod replicaset or deployment for head and statefulset for work will be ok. But if use them to reconcile headnode or worknode, reconcilePods
function will be changed a lot and need test
Try to summarize the current discussion:
- Currently, if a Pod's
restartPolicy
isAlways
, KubeRay will not delete the Pod. Instead, KubeRay waits for the Pod restarts by itself.- However, if a Pod is evicted due to disk pressure, the Pod's status will transition to
Failed
and will not be relocated to other K8s node even if therestartPolicy
isAlways
.- Our solution: if the Pod's status is
Failed
orSucceeded
, KubeRay should delete and create a new Pod no matter the value ofrestartPolicy
.Is my understanding correct? cc @DmitriGekhtman @andrewsykim
If we ignore the container restart strategy, I feel that there will be a scenario where the container is already restarting or even successful, but we are deleting it according to its previous failed state, whether it will cause a temporary exception to the node of the ray head.
Is it possible to make a simple judgment about the eviction scenario?
isPodEvicted := pod.Status.Reason == "Evicted"
if isRestartPolicyAlways && !isPodEvicted {
reason := fmt.Sprintf(
"The status of the %s Pod %s is %s. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' "+
"and it should be able to restart automatically.", nodeType, pod.Name, pod.Status.Phase)
return false, reason
}
statefulset for work will be ok
StatefulSet for workers wouldn't work for the autoscaling use-case. When scaling down, we need to specify the specific worker that's being removed. So, we need to manage individual pods for workers.
Is it possible to make a simple judgment about the eviction scenario?
Yeah, it'd be best for now to make the simplest fix to the code that's causing the issue :)
If we ignore the container restart strategy, I feel that there will be a scenario where the container is already restarting or even successful
If a Pod is Succeeded or Failed, the Pod will not restart based on this doc. In addition, the head Pod of a RayCluster should always be on if the CR's suspend
is not true.
If a Pod is Succeeded or Failed, the Pod will not restart based on this doc.
Pods that have an auto-restart policy set may undergo a transition between the Failed and Running states, but usually return to the Running state quickly
Got it. It makes sense. Thanks!
@kevin85421 Can I work on this issue? After some trial and errors I am finally able to reproduce the eviction behavior on my own computer. (I tried to let the head pod be evicted due to low memory on the node) Here are the steps:
# Create k3d cluster with 2 agent nodes, each with 3GB memory and hard eviction limit 1GiB
k3d cluster create \ [20:31:01]
--agents 2 \
--k3s-arg "--disable=traefik@server:0" \
--agents-memory 3g \
--k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:0" \
--k3s-arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:1"
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY_MEMORY:.status.capacity.memory,ALLOCATABLE_MEMORY:.status.allocatable.memory
# Output:
#
# NAME CAPACITY_MEMORY ALLOCATABLE_MEMORY
# k3d-k3s-default-agent-1 3221225Ki 2172649Ki
# k3d-k3s-default-agent-0 3221225Ki 2172649Ki
# k3d-k3s-default-server-0 32590664Ki 32590664Ki
# Taint agent 0 and 1 so that pods will only be deployed to server-0
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule
# Install Kuberay operator
helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --version 1.1.1 --create-namespace
# Taint server 0 and untaint agent 0 and 1 so that pods will only be deployed to agent 0 or 1
kubectl taint nodes k3d-k3s-default-server-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule-
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule-
# Install RayCluster
# Note that the head and worker pods will be deployed to different nodes
# because the memory resource request for head pod is 2G
# and the memory resource request for worker pod is 1G
# and agent 0 and 1 each only has 2G allocatable memory
helm install raycluster kuberay/ray-cluster --version 1.1.1
# Copy statically linked "stress-ng" binary into head pod
kubectl cp ./stress-ng <head-pod>:/home/ray
# Open a shell on the head pod
kubectl exec -it <head-pod> -- bash
# Simulate memory stress
./stress-ng --vm 4 --vm-bytes 2G --vm-keep
Result:
- The head pod is evicted.
- No new head pods are created even if the memory on agent-0 is released and becomes available again. (See figure 2)
kubectl describe <head-pod>
Name: raycluster-kuberay-head-ldg9f
Namespace: default
Priority: 0
Service Account: default
Node: k3d-k3s-default-agent-0/192.168.128.4
Start Time: Tue, 02 Jul 2024 22:16:22 +0800
Labels: app.kubernetes.io/created-by=kuberay-operator
app.kubernetes.io/instance=raycluster
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kuberay
helm.sh/chart=ray-cluster-1.1.1
ray.io/cluster=raycluster-kuberay
ray.io/group=headgroup
ray.io/identifier=raycluster-kuberay-head
ray.io/is-ray-node=yes
ray.io/node-type=head
Annotations: ray.io/ft-enabled: false
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Threshold quantity: 1Gi, available: 991045Ki.
IP: 10.42.0.2
IPs:
IP: 10.42.0.2
Controlled By: RayCluster/raycluster-kuberay
Containers:
ray-head:
Container ID:
Image: rayproject/ray:2.9.0
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/bash
-lc
--
Args:
ulimit -n 65536; ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --num-cpus=1 --memory=2000000000
State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was terminated
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted. The container used to be Running
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 1
Limits:
cpu: 1
memory: 2G
Requests:
cpu: 1
memory: 2G
Liveness: exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=30s timeout=1s period=5s #success=1 #failure=120
Readiness: exec [bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success] delay=10s timeout=1s period=5s #success=1 #failure=10
Environment:
RAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster'])
RAY_CLOUD_INSTANCE_ID: raycluster-kuberay-head-ldg9f (v1:metadata.name)
RAY_NODE_TYPE_NAME: (v1:metadata.labels['ray.io/group'])
KUBERAY_GEN_RAY_START_CMD: ray start --head --dashboard-host=0.0.0.0 --metrics-export-port=8080 --block --dashboard-agent-listen-port=52365 --num-cpus=1 --memory=2000000000
RAY_PORT: 6379
RAY_ADDRESS: 127.0.0.1:6379
RAY_USAGE_STATS_KUBERAY_IN_USE: 1
RAY_USAGE_STATS_EXTRA_TAGS: kuberay_version=v1.1.1;kuberay_crd=RayCluster
REDIS_PASSWORD:
RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE: 1
Mounts:
/dev/shm from shared-mem (rw)
/tmp/ray from log-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vhmqp (ro)
Conditions:
Type Status
DisruptionTarget True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
log-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
shared-mem:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 2G
kube-api-access-vhmqp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned default/raycluster-kuberay-head-ldg9f to k3d-k3s-default-agent-0
Normal Pulling 15m kubelet Pulling image "rayproject/ray:2.9.0"
Normal Pulled 14m kubelet Successfully pulled image "rayproject/ray:2.9.0" in 46.191933304s (46.1919395s including waiting)
Normal Created 14m kubelet Created container ray-head
Normal Started 14m kubelet Started container ray-head
Warning Evicted 9m22s kubelet The node was low on resource: memory. Threshold quantity: 1Gi, available: 991045Ki.
Normal Killing 9m22s kubelet Stopping container ray-head
Warning Unhealthy 9m20s kubelet Liveness probe failed: command "bash -c wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep success" timed out
Warning ExceededGracePeriod 9m12s kubelet Container runtime did not kill the pod within specified grace period.
I also tried kubectl drain
but the head pod will be created on another node. This may be the API-Initiated Eviction in the stackoverflow post that @DmitriGekhtman mentioned.
@MortalHappiness what's the restart policy in your head pod? (it wasn't in the describe output)
If it's Always
and you validated pod state as Failed
during disk pressure eviction, then I think this comment is not true and it can be updated.
@andrewsykim I used the default helm chart without specifying any values. So the restartPolicy is Always
.
Closed by #2217