kuberay
kuberay copied to clipboard
If the head node dies, the cluster is never restored [Bug]
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
After Ray cluster is created and started, go and manually kill the head node. In this case an operator, as expected, will restore the head node. The issue is that worker nodes will never connect back to the head node and as a result, you will end up with a single node (head node) cluster.
The only way to fix this problem as far as I can see is either to restart all the worker nodes after restarting the head node or alternatively restart a cluster itself.
Reproduction script
Just manually kill a head node pod
Anything else
Every time the head node pod is deleted
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Hi @blublinsky, which version of Ray did you use? Thanks!
2.1 and 2.4. But it should not matter. It only works if a new head node has the same IP, which can't be guaranteed
This sounds like a Ray bug. Typically, a worker will kill itself if it cannot connect to the GCS for 60 seconds (by default). You can refer to https://github.com/ray-project/kuberay/pull/1036 for more details.
cc some Ray core folks @iycheng @scv119