kuberay
kuberay copied to clipboard
fix: change RayCluster .status.state to unhealthy if not all Pods running
Right now if a RayCluster's .status.state
is ready, and later any of its Pods
change to not running, the .status.state
is still ready. This change
makes the RayCluster controller change .status.state
to unhealthy.
example of a RayCluster that's ready but Pods aren't running
❯ kubectl get rayclusters
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
ready-zenonesque 1 1 ready 17d
❯ kubectl get pods
NAME READY STATUS RESTARTS AGE
ready-zenonesque-head-6bbc2 3/4 ImagePullBackOff 0 2d20h
ready-zenonesque-worker-worker-fpbmb 1/2 CrashLoopBackOff 547 (4m14s ago) 17d
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] This PR is not tested :(
cc @kevin85421
I will review this PR because we are approaching the release of KubeRay v1.1.0, and State
is an important part for RayJob. We can revisit this PR after the release to avoid introducing any potential stability issues to the upcoming release.
This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.
After rethinking this, we can narrow down the scope of this PR a bit to accelerate the merging process. We can add only ReadyWorkerReplicas int32
in this PR, and remove Ready
and add a new field in HeadInfo
in the follow-up PRs.
Narrow down the scope of this PR to add ReadyWorkerReplicas int32
. I have already started working on some items mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 (ex: #2068).
We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.
Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility
Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility
Good point. I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.
I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.
I agree that it's not well defined, but I don't think that means users are not using it.
The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state
to be ready
. I personally think it would be quite disruptive to users to remove this state.
The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.
Your point makes sense to me. I am currently considering the tradeoff:
- Option 1: Change the definition of
ready
(e.g. this PR)- This PR changes the definition of
ready
, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.
- This PR changes the definition of
- Option 2: Remove
ready
and adopt the solution mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 instead.- All related automation processes will fail, but it is easy for users and us to troubleshoot.
We can also wait for one or two months to see if users report any issues with the breaking changes in RayJob status. In KubeRay v1.1.0, almost all RayJob statuses have been redefined. If only a few users raise concerns about this, we may be able to infer that using CR status is not common among users.
This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.
Good point that this is also considered a breaking change.
I think the technically correct change would be to leave the meaning of .status.state = ready
unchanged and add new fields in HeadInfo
that reflect the new state we want to reflect. However, I consider this PR to be fixing a major bug in the ready
definition. A RayCluster that previously had all pods running but now no longer have any pods should definitely not be considered ready
. Most users that depend on the ready
state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready
is actually okay to do in this particular case. I think it's unlikely to break automation significantly and may catch issues that previously were not captured in the old behavior
Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case.
I would say this is a breaking change. The RayJob before KubeRay v1.1.0 will have a lot of issues with this PR's definition of ready
. That's why I don't want to merge this PR before the release. The definition of ready
is confusing to some KubeRay developers, who should have a deeper understanding for the benefit of the users. KubeRay has accumulated a lot of technical debt in its early stages. The status of RayCluster is one issue that we must fix.
- We can update
HeadInfo
at first. - Wait for one month to gather user feedback on RayJob v1.1.0. This will give us a better understanding of whether the status is important to users.
- I will remove
unhealthy
.
@kevin85421 I updated this PR to only add RayCluster.status.readyWorkerReplicas
. Lmk if it looks good.
@MortalHappiness will take https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-2044141001.