kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

fix: change RayCluster .status.state to unhealthy if not all Pods running

Open davidxia opened this issue 1 year ago • 2 comments

Right now if a RayCluster's .status.state is ready, and later any of its Pods change to not running, the .status.state is still ready. This change makes the RayCluster controller change .status.state to unhealthy.

example of a RayCluster that's ready but Pods aren't running

❯ kubectl get rayclusters
NAME               DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
ready-zenonesque   1                 1                   ready    17d

❯ kubectl get pods
NAME                                   READY   STATUS             RESTARTS          AGE
ready-zenonesque-head-6bbc2            3/4     ImagePullBackOff   0                 2d20h
ready-zenonesque-worker-worker-fpbmb   1/2     CrashLoopBackOff   547 (4m14s ago)   17d

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [x] Unit tests
    • [x] Manual tests
    • [ ] This PR is not tested :(

davidxia avatar Feb 19 '24 17:02 davidxia

cc @kevin85421

davidxia avatar Feb 19 '24 19:02 davidxia

I will review this PR because we are approaching the release of KubeRay v1.1.0, and State is an important part for RayJob. We can revisit this PR after the release to avoid introducing any potential stability issues to the upcoming release.

kevin85421 avatar Feb 22 '24 17:02 kevin85421

This PR can remove Ready from the ClusterState, and then (1) add a field in HeadInfo to indicate whether the head Pod is ready or not. (2) add a new field (ReadyWorkerReplicas int32) in RayClusterStatus which is similar to ReadyReplicas in ReplicaSetStatus.

After rethinking this, we can narrow down the scope of this PR a bit to accelerate the merging process. We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

kevin85421 avatar Apr 03 '24 23:04 kevin85421

Narrow down the scope of this PR to add ReadyWorkerReplicas int32. I have already started working on some items mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 (ex: #2068).

kevin85421 avatar Apr 05 '24 09:04 kevin85421

We can add only ReadyWorkerReplicas int32 in this PR, and remove Ready and add a new field in HeadInfo in the follow-up PRs.

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

andrewsykim avatar Apr 05 '24 13:04 andrewsykim

Is it safe to remove the Ready state? I imagine a lot of people depend on this status already and it would break compatibility

Good point. I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

kevin85421 avatar Apr 05 '24 20:04 kevin85421

I suspect that the RayCluster status is rarely used by users because the ClusterState is not well-defined, yet almost no users have complained about it.

I agree that it's not well defined, but I don't think that means users are not using it.

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

andrewsykim avatar Apr 07 '24 01:04 andrewsykim

The reality is that many users will not open issues or complain about something even if they use it. @davidxia opening this PR to change the behavior is one signal that users depend on this field already. If I had to guess, there's probably a lot of (not public) automation that creates a RayCluster and then waits for the .status.state to be ready. I personally think it would be quite disruptive to users to remove this state.

Your point makes sense to me. I am currently considering the tradeoff:

  • Option 1: Change the definition of ready (e.g. this PR)
    • This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.
  • Option 2: Remove ready and adopt the solution mentioned in https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-1978124073 instead.
    • All related automation processes will fail, but it is easy for users and us to troubleshoot.

We can also wait for one or two months to see if users report any issues with the breaking changes in RayJob status. In KubeRay v1.1.0, almost all RayJob statuses have been redefined. If only a few users raise concerns about this, we may be able to infer that using CR status is not common among users.

kevin85421 avatar Apr 08 '24 17:04 kevin85421

This PR changes the definition of ready, which is also backward incompatible. Some automation processes may pass, but others may fail and be hard to debug because this change could represent an "unknown unknown" for users.

Good point that this is also considered a breaking change.

I think the technically correct change would be to leave the meaning of .status.state = ready unchanged and add new fields in HeadInfo that reflect the new state we want to reflect. However, I consider this PR to be fixing a major bug in the ready definition. A RayCluster that previously had all pods running but now no longer have any pods should definitely not be considered ready. Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case. I think it's unlikely to break automation significantly and may catch issues that previously were not captured in the old behavior

andrewsykim avatar Apr 08 '24 23:04 andrewsykim

Most users that depend on the ready state probably expected it to cover this scenario anyways, so I think changing the current definition of .status.state = ready is actually okay to do in this particular case.

I would say this is a breaking change. The RayJob before KubeRay v1.1.0 will have a lot of issues with this PR's definition of ready. That's why I don't want to merge this PR before the release. The definition of ready is confusing to some KubeRay developers, who should have a deeper understanding for the benefit of the users. KubeRay has accumulated a lot of technical debt in its early stages. The status of RayCluster is one issue that we must fix.

  • We can update HeadInfo at first.
  • Wait for one month to gather user feedback on RayJob v1.1.0. This will give us a better understanding of whether the status is important to users.
  • I will remove unhealthy.

kevin85421 avatar Apr 09 '24 05:04 kevin85421

@kevin85421 I updated this PR to only add RayCluster.status.readyWorkerReplicas. Lmk if it looks good.

davidxia avatar Apr 29 '24 16:04 davidxia

@MortalHappiness will take https://github.com/ray-project/kuberay/pull/1930#pullrequestreview-2044141001.

kevin85421 avatar May 07 '24 22:05 kevin85421