kuberay
kuberay copied to clipboard
[Bug] KubeRay operator periodically crashes retrieving resource lock
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
The KubeRay operator will periodically (i.e., once every couple days) crash and restart with the following error message:
E0715 05:43:03.622768 1 leaderelection.go:330] error retrieving resource lock default/ray-operator-leader: Get "[https://10.96.0.1:443/api/v1/namespaces/default/configmaps/ray-operator-leader](https://10.96.0.1/api/v1/namespaces/default/configmaps/ray-operator-leader)": context deadline exceeded
I0715 05:43:03.622848 1 leaderelection.go:283] failed to renew lease default/ray-operator-leader: timed out waiting for the condition
2023-07-15T05:43:03.622Z ERROR setup problem running manager {"error": "leader election lost"}
main.main
/workspace/main.go:159
runtime.main
/usr/local/go/src/runtime/proc.go:255
2023-07-15T05:43:03.623Z INFO Stopping and waiting for non leader election runnables
2023-07-15T05:43:03.623Z DEBUG events Normal {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "kuberay-operator-58d95f6df6-tcvpm_8d4717d9-9b6e-481e-9baf-5f67f1189090 stopped leading"}
2023-07-15T05:43:03.623Z DEBUG events Normal {"object": {"kind":"Lease","namespace":"default","name":"ray-operator-leader","uid":"3d393854-5ef6-4c5d-b8f3-decbeced7d35","apiVersion":"coordination.k8s.io/v1","resourceVersion":"30645234"}, "reason": "LeaderElection", "message": "kuberay-operator-58d95f6df6-tcvpm_8d4717d9-9b6e-481e-9baf-5f67f1189090 stopped leading"}
I am using KubeRay v0.5.0 (currently running into issue https://github.com/ray-project/kuberay/issues/1238 with v0.5.2, so cannot use that version). I also have tried increasing the resource limits and requests for the operator, but the issue persists.
Additionally, the restarting of the operator breaks the autoscalers of the deployed RayClusters, locking them into the scale they were in at the time of the restart. I am not sure if this should be filed as a separate issue or not.
Reproduction script
Cannot reproduce this error easily besides deploying the KubeRay operator (along with RayClusters) and waiting for the error to occur. The command I use to deploy the operator is:
helm install kuberay-operator kuberay/kuberay-operator \
--version 0.5.0 \
--set resources.limits.memory=2Gi \
--set resources.requests.memory=2Gi \
--set resources.limits.cpu=500m \
--set resources.requests.cpu=500m
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
The following links may be useful.
- https://support.hashicorp.com/hc/en-us/articles/4404634420755-Why-am-I-seeing-context-deadline-exceeded-errors
- https://stackoverflow.com/questions/75148975/leaderelections-failing-lease-unable-to-be-renewed-automatically
- https://discuss.kubernetes.io/t/kubeadm-init-fails-kube-scheduler-fails-with-error-retrieving-resource-lock-kube-system-kube-scheduler-context-deadline-exceeded-client-timeout-exceeded-while-awaiting-headers/24389/2
Would you mind conducting two experiments:
- Experiment 1: Increase the memory limit/request for the KubeRay operator Pod.
- Experiment 2: Disable leader election by setting the flag
enable-leader-electionflag.- https://github.com/ray-project/kuberay/blob/f880b7201660e07b55bdda665ff00777a4c788b4/ray-operator/main.go#L56
- https://github.com/ray-project/kuberay/blob/f880b7201660e07b55bdda665ff00777a4c788b4/helm-chart/kuberay-operator/templates/deployment.yaml#L39-L55
Additionally, would you mind sharing more details about your Kubernetes (e.g. EKS? GKE?) and KubeRay setup (1 operator for all namespaces in the k8s cluster? How many RayClusters does the operator control?)