kuberay
kuberay copied to clipboard
[Bug] Multiple head breaks operator
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I manually delete a head pod
kubectl get pods -o wide -n ray-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kuberay-apiserver-676ddd6dfb-ts8h2 1/1 Running 5 20d 10.1.15.83 docker-desktop <none> <none>
kuberay-operator-5d58699bd6-mnpjl 1/1 Running 6 16d 10.1.15.85 docker-desktop <none> <none>
ray-operator-75dbbf8587-xhtdx 1/1 Running 0 6d23h 10.1.15.92 docker-desktop <none> <none>
raycluster-ingress-head-ps7mv 1/1 Running 2 16d 10.1.15.84 docker-desktop <none> <none>
kubectl delete pod raycluster-ingress-head-ps7mv -n ray-system
operator crashes and I notice it creates two heads pods which should be a bug.
kubectl get pods -o wide -n ray-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kuberay-apiserver-676ddd6dfb-ts8h2 1/1 Running 5 20d 10.1.15.83 docker-desktop <none> <none>
kuberay-operator-5d58699bd6-mnpjl 0/1 Error 11 16d 10.1.15.85 docker-desktop <none> <none>
ray-operator-75dbbf8587-xhtdx 0/1 Error 5 6d23h 10.1.15.92 docker-desktop <none> <none>
raycluster-ingress-head-ftpmd 1/1 Running 0 3m13s 10.1.15.93 docker-desktop <none> <none>
raycluster-ingress-head-zt59c 1/1 Running 0 3m13s 10.1.15.94 docker-desktop <none> <none>
Reproduction script
manually delete a head pod
Anything else
2021-11-25T03:15:16.652Z INFO raycluster-controller reconcilePods {"more than 1 head pod found for cluster": "raycluster-ingress"}
E1125 03:15:16.652657 1 runtime.go:78] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x157c860, 0xc0000461c8)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x86
panic(0x157c860, 0xc0000461c8)
/usr/local/opt/[email protected]/libexec/src/runtime/panic.go:965 +0x1b9
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).reconcilePods(0xc000379c00, 0xc000390000, 0x0, 0x0)
/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:204 +0x2ab5
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc000379c00, 0x18338f8, 0xc00050e270, 0xc0002ff190, 0xa, 0xc0000d3e90, 0x12, 0xc00050e270, 0xc000030000, 0x1539d60, ...)
/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:96 +0x416
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000607400, 0x1833850, 0xc0003aebc0, 0x1509480, 0xc00065a160)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000607400, 0x1833850, 0xc0003aebc0, 0x2000000000000)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1833850, 0xc0003aebc0)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000560750)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1806dc0, 0xc00050e180, 0xc0003aeb01, 0xc0004e8f00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000560750, 0x3b9aca00, 0x0, 0xc00029ac01, 0xc0004e8f00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00, 0x0, 0x1)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x497
panic: runtime error: index out of range [1] with length 1 [recovered]
panic: runtime error: index out of range [1] with length 1
goroutine 224 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x109
panic(0x157c860, 0xc0000461c8)
/usr/local/opt/[email protected]/libexec/src/runtime/panic.go:965 +0x1b9
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).reconcilePods(0xc000379c00, 0xc000390000, 0x0, 0x0)
/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:204 +0x2ab5
github.com/ray-project/kuberay/ray-operator/controllers.(*RayClusterReconciler).Reconcile(0xc000379c00, 0x18338f8, 0xc00050e270, 0xc0002ff190, 0xa, 0xc0000d3e90, 0x12, 0xc00050e270, 0xc000030000, 0x1539d60, ...)
/Users/jiaxin/go/src/github.com/ray-project/kuberay/ray-operator/controllers/raycluster_controller.go:96 +0x416
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000607400, 0x1833850, 0xc0003aebc0, 0x1509480, 0xc00065a160)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000607400, 0x1833850, 0xc0003aebc0, 0x2000000000000)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1833850, 0xc0003aebc0)
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000560750)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000443f50, 0x1806dc0, 0xc00050e180, 0xc0003aeb01, 0xc0004e8f00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000560750, 0x3b9aca00, 0x0, 0xc00029ac01, 0xc0004e8f00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00, 0x0, 0x1)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1833850, 0xc0003aebc0, 0xc00059e210, 0x3b9aca00)
/Users/jiaxin/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/Users/jiaxin/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x497
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
I think the current head pod reconcile logic is correct. We should have a test that this issue doesn't happen and that the head pod is reconciled back into existence correctly after a crash. Perhaps GCS HA tests should cover head pod deletion cc @iycheng @brucez-anyscale
@wilsonwang371
This is stale.