kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[RayCluster][Fix] Add expectations of RayCluster

Open Eikykun opened this issue 1 year ago • 13 comments

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

  • https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L146
  • https://github.com/openkruise/kruise/tree/master/pkg/util/expectations

Related issue number

  • #715
  • #1936

Checks

  • [ ] I've made sure the tests are passing.
  • Testing Strategy
    • [x] Unit tests
    • [ ] Manual tests
    • [ ] This PR is not tested :(

Eikykun avatar May 16 '24 13:05 Eikykun

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

kevin85421 avatar May 18 '24 23:05 kevin85421

I will review this PR tomorrow.

kevin85421 avatar May 28 '24 05:05 kevin85421

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

kevin85421 avatar May 29 '24 03:05 kevin85421

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

rueian avatar May 30 '24 15:05 rueian

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

Eikykun avatar Jun 03 '24 04:06 Eikykun

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to https://github.com/ray-project/kuberay/issues/715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

rueian avatar Jun 08 '24 06:06 rueian

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

kevin85421 avatar Jun 11 '24 06:06 kevin85421

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

kevin85421 avatar Jun 12 '24 07:06 kevin85421

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency. As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

Eikykun avatar Jun 12 '24 09:06 Eikykun

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

Eikykun avatar Jun 12 '24 09:06 Eikykun

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

kevin85421 avatar Jun 12 '24 17:06 kevin85421

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

kevin85421 avatar Jun 17 '24 04:06 kevin85421

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364 Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.

Eikykun avatar Jun 18 '24 04:06 Eikykun

Could you help approve the workflow? cc @kevin85421 If there are any questions, concerns, or additional changes needed on my part, please let me know. I am happy to assist in any way possible to expedite the process :)

Eikykun avatar Oct 29 '24 04:10 Eikykun

@Eikykun, thank you for following up! Sorry for the late review. I had concerns about merging such a large change before Ray Summit. Now, I have enough time to verify the correctness and stability of this PR. This is also one of the most important stability improvements in the post-summit roadmap.

https://docs.google.com/document/d/1YYuAQkHKz2UTFMnTDJLg4qnW2OAlYQqjvetP_nvt0nI/edit?tab=t.0

I will resume reviewing this PR this week.

kevin85421 avatar Oct 29 '24 04:10 kevin85421

cc @MortalHappiness can you also give this PR a pass of review?

kevin85421 avatar Nov 05 '24 06:11 kevin85421

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.
  • Your implementation and the code in the description differ quite a bit, so I'm curious:
    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

MortalHappiness avatar Nov 05 '24 16:11 MortalHappiness

A few questions I'd like to ask:

  • Is ExpectedResourceType = "RayCluster" actually used? So far, I've only seen "Pod" being used.

  • Your implementation and the code in the description differ quite a bit, so I'm curious:

    • If only "Pod" is being used, why separate it into ActiveExpectation and RayClusterExpectation?
    • Why have implementations for both, with RayClusterExpectation composing ActiveExpectation? From what I can see, it seems like there only needs to be one. It feels more similar to ScaleExpectations, which mainly expects two operations: "Create" and "Delete."

This might be an issue that was left over after the last simplification. Initially, I added many types like RayCluster, Service, etc., considering that more than the scale pod might require expectations. If we only consider the scaling logic for each group, we can significantly simplify the code. In fact, I recently streamlined the code and reduced the scaling expectation code to around 100 lines. You can find it in the latest commit.

Eikykun avatar Nov 06 '24 10:11 Eikykun

By the way, maybe you missed this because this comment is folded. https://github.com/ray-project/kuberay/pull/2150#discussion_r1843097987

Could you either:

  • Change the ExpectScalePod to return an error, just like how kubernetes does.
  • Add some comments to say that errors are ignored because it panics instead of returning error.

MortalHappiness avatar Nov 18 '24 10:11 MortalHappiness

By the way, maybe you missed this because this comment is folded. #2150 (comment)

Could you either:

  • Change the ExpectScalePod to return an error, just like how kubernetes does.
  • Add some comments to say that errors are ignored because it panics instead of returning error.

Thank you for your patient review. I have added some comments.

Eikykun avatar Nov 18 '24 11:11 Eikykun

@kevin85421 Can you merge the PR?

Eikykun avatar Nov 21 '24 11:11 Eikykun

Merged. Thank you for the contribution!

kevin85421 avatar Nov 25 '24 05:11 kevin85421