kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

If the head node dies, the cluster is never restored [Bug]

Open blublinsky opened this issue 2 years ago • 3 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

After Ray cluster is created and started, go and manually kill the head node. In this case an operator, as expected, will restore the head node. The issue is that worker nodes will never connect back to the head node and as a result, you will end up with a single node (head node) cluster.

The only way to fix this problem as far as I can see is either to restart all the worker nodes after restarting the head node or alternatively restart a cluster itself.

Reproduction script

Just manually kill a head node pod

Anything else

Every time the head node pod is deleted

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

blublinsky avatar Jun 05 '23 14:06 blublinsky

Hi @blublinsky, which version of Ray did you use? Thanks!

kevin85421 avatar Jun 05 '23 18:06 kevin85421

2.1 and 2.4. But it should not matter. It only works if a new head node has the same IP, which can't be guaranteed

blublinsky avatar Jun 05 '23 18:06 blublinsky

This sounds like a Ray bug. Typically, a worker will kill itself if it cannot connect to the GCS for 60 seconds (by default). You can refer to https://github.com/ray-project/kuberay/pull/1036 for more details.

cc some Ray core folks @iycheng @scv119

kevin85421 avatar Jun 05 '23 19:06 kevin85421