kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Multiple head breaks operator

Open Jeffwan opened this issue 3 years ago • 2 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I manually delete a head pod

kubectl get pods -o wide -n ray-system
NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE             NOMINATED NODE   READINESS GATES
kuberay-apiserver-676ddd6dfb-ts8h2   1/1     Running   5          20d     10.1.15.83   docker-desktop   <none>           <none>
kuberay-operator-5d58699bd6-mnpjl    1/1     Running   6          16d     10.1.15.85   docker-desktop   <none>           <none>
ray-operator-75dbbf8587-xhtdx        1/1     Running   0          6d23h   10.1.15.92   docker-desktop   <none>           <none>
raycluster-ingress-head-ps7mv        1/1     Running   2          16d     10.1.15.84   docker-desktop   <none>           <none>
kubectl delete pod raycluster-ingress-head-ps7mv -n ray-system

operator crashes and I notice it creates two heads pods which should be a bug.

 kubectl get pods -o wide -n ray-system
NAME                                 READY   STATUS    RESTARTS   AGE     IP           NODE             NOMINATED NODE   READINESS GATES
kuberay-apiserver-676ddd6dfb-ts8h2   1/1     Running   5          20d     10.1.15.83   docker-desktop   <none>           <none>
kuberay-operator-5d58699bd6-mnpjl    0/1     Error     11         16d     10.1.15.85   docker-desktop   <none>           <none>
ray-operator-75dbbf8587-xhtdx        0/1     Error     5          6d23h   10.1.15.92   docker-desktop   <none>           <none>
raycluster-ingress-head-ftpmd        1/1     Running   0          3m13s   10.1.15.93   docker-desktop   <none>           <none>
raycluster-ingress-head-zt59c        1/1     Running   0          3m13s   10.1.15.94   docker-desktop   <none>           <none>

Reproduction script

manually delete a head pod

Anything else

2021-11-25T03:15:16.652Z	INFO	raycluster-controller	reconcilePods 	{"more than 1 head pod found for cluster": "raycluster-ingress"}
E1125 03:15:16.652657       1 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x157c860, 0xc0000461c8)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x86
panic(0x157c860, 0xc0000461c8)
	/usr/local/opt/[email protected]/libexec/src/runtime/panic.go:965 +0x1b9
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).reconcilePods(0xc000379c00, 0xc000390000, 0x0, 0x0)
	/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:204 +0x2ab5
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc000379c00, 0x18338f8, 0xc00050e270, 0xc0002ff190, 0xa, 0xc0000d3e90, 0x12, 0xc00050e270, 0xc000030000, 0x1539d60, ...)
	/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:96 +0x416
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000607400, 0x1833850, 0xc0003aebc0, 0x1509480, 0xc00065a160)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000607400, 0x1833850, 0xc0003aebc0, 0x2000000000000)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1833850, 0xc0003aebc0)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000560750)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1806dc0, 0xc00050e180, 0xc0003aeb01, 0xc0004e8f00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000560750, 0x3b9aca00, 0x0, 0xc00029ac01, 0xc0004e8f00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00, 0x0, 0x1)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x497
panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1

goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x109
panic(0x157c860, 0xc0000461c8)
	/usr/local/opt/[email protected]/libexec/src/runtime/panic.go:965 +0x1b9
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).reconcilePods(0xc000379c00, 0xc000390000, 0x0, 0x0)
	/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:204 +0x2ab5
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc000379c00, 0x18338f8, 0xc00050e270, 0xc0002ff190, 0xa, 0xc0000d3e90, 0x12, 0xc00050e270, 0xc000030000, 0x1539d60, ...)
	/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:96 +0x416
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000607400, 0x1833850, 0xc0003aebc0, 0x1509480, 0xc00065a160)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000607400, 0x1833850, 0xc0003aebc0, 0x2000000000000)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1833850, 0xc0003aebc0)
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000560750)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1806dc0, 0xc00050e180, 0xc0003aeb01, 0xc0004e8f00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000560750, 0x3b9aca00, 0x0, 0xc00029ac01, 0xc0004e8f00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00, 0x0, 0x1)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00)
	/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x497

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Jeffwan avatar Nov 25 '21 03:11 Jeffwan

I think the current head pod reconcile logic is correct. We should have a test that this issue doesn't happen and that the head pod is reconciled back into existence correctly after a crash. Perhaps GCS HA tests should cover head pod deletion cc @iycheng @brucez-anyscale

DmitriGekhtman avatar Jul 13 '22 05:07 DmitriGekhtman

@wilsonwang371

brucez-anyscale avatar Jul 14 '22 05:07 brucez-anyscale

This is stale.

DmitriGekhtman avatar Dec 09 '22 16:12 DmitriGekhtman