kuberay [Bug] KubeRay operator periodically crashes retrieving resource lock

[Bug] KubeRay operator periodically crashes retrieving resource lock

Open imilefch opened this issue 2 years ago • 1 comments

trafficstars

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

The KubeRay operator will periodically (i.e., once every couple days) crash and restart with the following error message:

E0715 05:43:03.622768       1 leaderelection.go:330] error retrieving resource lock default/ray-operator-leader: Get "[https://10.96.0.1:443/api/v1/namespaces/default/configmaps/ray-operator-leader](https://10.96.0.1/api/v1/namespaces/default/configmaps/ray-operator-leader)": context deadline exceeded
I0715 05:43:03.622848       1 leaderelection.go:283] failed to renew lease default/ray-operator-leader: timed out waiting for the condition
2023-07-15T05:43:03.622Z        ERROR   setup   problem running manager {"error": "leader election lost"}
main.main
        /workspace/main.go:159
runtime.main
        /usr/local/go/src/runtime/proc.go:255
2023-07-15T05:43:03.623Z        INFO    Stopping and waiting for non leader election runnables
2023-07-15T05:43:03.623Z        DEBUG   events  Normal  {"object": {"kind":"ConfigMap","apiVersion":"v1"}, "reason": "LeaderElection", "message": "kuberay-operator-58d95f6df6-tcvpm_8d4717d9-9b6e-481e-9baf-5f67f1189090 stopped leading"}
2023-07-15T05:43:03.623Z        DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"default","name":"ray-operator-leader","uid":"3d393854-5ef6-4c5d-b8f3-decbeced7d35","apiVersion":"coordination.k8s.io/v1","resourceVersion":"30645234"}, "reason": "LeaderElection", "message": "kuberay-operator-58d95f6df6-tcvpm_8d4717d9-9b6e-481e-9baf-5f67f1189090 stopped leading"}

I am using KubeRay v0.5.0 (currently running into issue https://github.com/ray-project/kuberay/issues/1238 with v0.5.2, so cannot use that version). I also have tried increasing the resource limits and requests for the operator, but the issue persists.

Additionally, the restarting of the operator breaks the autoscalers of the deployed RayClusters, locking them into the scale they were in at the time of the restart. I am not sure if this should be filed as a separate issue or not.

Reproduction script

Cannot reproduce this error easily besides deploying the KubeRay operator (along with RayClusters) and waiting for the error to occur. The command I use to deploy the operator is:

helm install kuberay-operator kuberay/kuberay-operator \
    --version 0.5.0 \
    --set resources.limits.memory=2Gi \
    --set resources.requests.memory=2Gi \
    --set resources.limits.cpu=500m \
    --set resources.requests.cpu=500m

Anything else

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Jul 17 '23 22:07 imilefch

The following links may be useful.

https://support.hashicorp.com/hc/en-us/articles/4404634420755-Why-am-I-seeing-context-deadline-exceeded-errors
https://stackoverflow.com/questions/75148975/leaderelections-failing-lease-unable-to-be-renewed-automatically
https://discuss.kubernetes.io/t/kubeadm-init-fails-kube-scheduler-fails-with-error-retrieving-resource-lock-kube-system-kube-scheduler-context-deadline-exceeded-client-timeout-exceeded-while-awaiting-headers/24389/2

Would you mind conducting two experiments:

Experiment 1: Increase the memory limit/request for the KubeRay operator Pod.
Experiment 2: Disable leader election by setting the flag enable-leader-election flag.
- https://github.com/ray-project/kuberay/blob/f880b7201660e07b55bdda665ff00777a4c788b4/ray-operator/main.go#L56
- https://github.com/ray-project/kuberay/blob/f880b7201660e07b55bdda665ff00777a4c788b4/helm-chart/kuberay-operator/templates/deployment.yaml#L39-L55

Additionally, would you mind sharing more details about your Kubernetes (e.g. EKS? GKE?) and KubeRay setup (1 operator for all namespaces in the k8s cluster? How many RayClusters does the operator control?)

Jul 19 '23 23:07 kevin85421

kuberay kuberay copied to clipboard

[Bug] KubeRay operator periodically crashes retrieving resource lock

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard