kuberay
kuberay copied to clipboard
[Bug] worker group cannot be removed from RayCluster
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
Adding a new WorkerGroup to a RayCluster works fine, you can see the worker pod come up. But when you remove that workergroup, the pod never gets terminated, and the head pod autoscaler starts to go into a crash loop.
I'm not 100% sure if this behavior is supported, but if so it's a bug.
UPDATE: We will not support this, see the discussion below. We should still reject such an update with a clear error message.
Reproduction script
# Modify the below YAML file to have `replicas: 1` so that we can watch a worker pod come up. Apply the YAML.
kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml
watch kubectl get pod # wait until head and worker pods come up
# Now add a new entry to `WorkerGroupSpec`, copying the first entry, but changing the name to `small-group-2`.
kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml
watch kubectl get pod # wait for the second worker pod to come up
# Now remove the new entry to `WorkerGroupSpec` that we just added.
kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml
watch kubectl get pod
Now the second worker pod will never terminate, and the head pod autoscaler container will crashloop with
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2234, in kuberay_autoscaler
run_kuberay_autoscaler(cluster_name, cluster_namespace)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 54, in run_kuberay_autoscaler
Monitor(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 586, in run
self._run()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/monitor.py", line 391, in _run
self.autoscaler.update()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 383, in update
raise e
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 376, in update
self._update()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/autoscaler.py", line 464, in _update
self.provider.post_process()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/batching_node_provider.py", line 150, in post_process
self.submit_scale_request(self.scale_request)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 282, in submit_scale_request
patch_payload = self._scale_request_to_patch_payload(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 363, in _scale_request_to_patch_payload
group_index = _worker_group_index(raycluster, node_type)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 185, in _worker_group_index
return group_names.index(group_name)
ValueError: 'small-group-2' is not in list
You can get the log with
kubectl logs raycluster-autoscaler-head-XXXXX -c autoscaler --previous
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
After discussing this offline, we decided not to support the removal of a worker group from RayCluster for the following reasons:
- The implementation is complex.
- There have been no user requests for this feature.
- We lack a method to drain the Ray nodes before deleting the Pods, which makes it highly likely that Ray Pods with running tasks or actors will be deleted.
Yup. We should still reject such an update with a clear error message. I'll leave the issue up to track this.