kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] worker group cannot be removed from RayCluster

Open architkulkarni opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Adding a new WorkerGroup to a RayCluster works fine, you can see the worker pod come up. But when you remove that workergroup, the pod never gets terminated, and the head pod autoscaler starts to go into a crash loop.

I'm not 100% sure if this behavior is supported, but if so it's a bug.

UPDATE: We will not support this, see the discussion below. We should still reject such an update with a clear error message.

Reproduction script


# Modify the below YAML file to have `replicas: 1` so that we can watch a worker pod come up.  Apply the YAML.
kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml

watch kubectl get pod # wait until head and worker pods come up

# Now add a new entry to `WorkerGroupSpec`, copying the first entry, but changing the name to `small-group-2`.

kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml

watch kubectl get pod # wait for the second worker pod to come up

# Now remove the new entry to `WorkerGroupSpec` that we just added.

kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml

watch kubectl get pod

Now the second worker pod will never terminate, and the head pod autoscaler container will crashloop with

Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2234, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 54, in run_kuberay_autoscaler
    Monitor(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 586, in run
    self._run()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 391, in _run
    self.autoscaler.update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 383, in update
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 376, in update
    self._update()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 464, in _update
    self.provider.post_process()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/batching_node_provider.py", line 150, in post_process
    self.submit_scale_request(self.scale_request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 282, in submit_scale_request
    patch_payload = self._scale_request_to_patch_payload(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 363, in _scale_request_to_patch_payload
    group_index = _worker_group_index(raycluster, node_type)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 185, in _worker_group_index
    return group_names.index(group_name)
ValueError: 'small-group-2' is not in list

You can get the log with

kubectl logs raycluster-autoscaler-head-XXXXX -c autoscaler --previous

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

architkulkarni avatar Dec 12 '23 16:12 architkulkarni

After discussing this offline, we decided not to support the removal of a worker group from RayCluster for the following reasons:

  • The implementation is complex.
  • There have been no user requests for this feature.
  • We lack a method to drain the Ray nodes before deleting the Pods, which makes it highly likely that Ray Pods with running tasks or actors will be deleted.

kevin85421 avatar Dec 14 '23 22:12 kevin85421

Yup. We should still reject such an update with a clear error message. I'll leave the issue up to track this.

architkulkarni avatar Dec 14 '23 23:12 architkulkarni