kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

If the head node dies, the cluster is never restored [Bug]

Open blublinsky opened this issue 2 years ago • 3 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

After Ray cluster is created and started, go and manually kill the head node. In this case an operator, as expected, will restore the head node. The issue is that worker nodes will never connect back to the head node and as a result, you will end up with a single node (head node) cluster.

The only way to fix this problem as far as I can see is either to restart all the worker nodes after restarting the head node or alternatively restart a cluster itself.

Reproduction script

Just manually kill a head node pod

Anything else

Every time the head node pod is deleted

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

blublinsky avatar Jun 05 '23 14:06 blublinsky