Search before asking

[x] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When reconciling pod quantities, the RayCluster controller uses pod informer cache as the sole source of truth on current pod state. The number of pods to create or delete are computed as the difference between the RayCluster spec's replica count and the number of pods listed from the cache.

The cache is frequently stale and there's currently no accounting mechanism to try to maintain read-after-write consistency. As a result, situations arise where more pods are launched than necessary and then a random pod is deleted to readjust.

See https://github.com/ray-project/kuberay/issues/704 for logs from an example.

This is especially problematic for Ray since Ray worker pods can store application-critical state.

Reproduction script

Create several Ray clusters and check the operator logs for spurious pod creation followed by deletion.

Anything else

There's a bit of discussion how the replica set controller handles this here: https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332 Replica set controller code: https://github.com/kubernetes/kubernetes/blob/29ddedae1d0a07dfc7a89693daf3dbf4be766be3/pkg/controller/replicaset/replica_set.go#L367-L368 Note that "expectations" are tracked when a pod is created.

Kruise CloneSet pod reconciliation logic is here: https://github.com/openkruise/kruise/blob/5839c5ba1909b463242a3603a7d22d45aed7fa6e/pkg/controller/cloneset/cloneset_controller.go#L225-L280

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Nov 12 '22 20:11 DmitriGekhtman

cc @akanso @Jeffwan

Someone should probably take a closer look at the replica set code -- similar methods can be applied to KubeRay pod reconciliation.

Nov 12 '22 21:11 DmitriGekhtman

Thank you for opening a new thread to highlight this issue! I plan to dive deep into this issue.

Nov 13 '22 06:11 kevin85421

ReplicaSet Controller

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go

ReplicaSet (RC) controller maintains expectations (A TTLCache of pod creates/deletes each RC expects to see). If expectations.add or expectations.del is larger than 0, the reconciler will skip synchronization with the RC because it is waiting for some Pod add events or del events.

Take #704 as an example to explain how expectations avoid double creation of Pods, in the first iteration of reconcilePods, diff is equal to 1 (CR specifies 1 worker Pod, but no Pod is running.), we set expectations.add to 1 because we expect 1 Pod add event.

https://github.com/ray-project/kuberay/blob/0579c90d8e807cfa866b175e27beb9afc5526f89/ray-operator/controllers/ray/raycluster_controller.go#L479

In the second iteration of reconcilePods, although the worker pod is not created (inconsistency between Kubernetes API server and local informer cache), the synchronization will be skipped because expectations.add is larger than 0.

When the informer cache receives the Pod add event from Kubernetes API server, it decreases expectations.add by 1 in the Pod's resource event handler. Then, the synchronization will not be skipped, and diff will be 0 (no more Pod needs to be created.).

Summary

TLDR: We can solve this issue by maintaining expectations.

Pros: We have more flexibility to support our requirements.
Cons:
- Increase system complexity (contributors need to be familiar with how expectations work) => more bugs
- We need to implement scale-up/down, update, and deletion by ourselves.

Nov 14 '22 18:11 kevin85421

Instead of the replicaset solution with non-deterministic pod names and a cache of the controller's actions, it could be simpler to take the statefulset solution and use deterministic pod names -- then double-creation of pods due to cache staleness would manifest as a transient 409 "already exists" conflict.

Nov 14 '23 03:11 DmitriGekhtman

Instead of the replicaset solution with non-deterministic pod names and a cache of the controller's actions, it could be simpler to take the statefulset solution and use deterministic pod names -- then double-creation of pods due to cache staleness would manifest as a transient 409 "already exists" conflict.

Good point. Maintaining the names of Pods may be challenging, especially with Ray's autoscaling feature.

Nov 14 '23 04:11 kevin85421

However, it might bring more trouble when it comes to failover, for instance, we name each pod as "${workerGroupsName}-${idx}", when pod gets OOM issue, controller will restart it using same pod name and hostname, it might confuse the GCS wether it is the old one just reconnect, or a new worker node.

Nov 14 '23 07:11 Irvingwangjr

I don't think the gcs currently knows the name of the pod. But, conceivably, you could try to increment indices such that you never recreate a pod with the same name.

Nov 14 '23 18:11 DmitriGekhtman

em. introducing expectation definitely improve the complexity a lot. We use similar strategy in kubeflow training in the past. Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating. I kind of feel the most easiest fix at this moment is to request the apiserver instead. while, it adds some burden for sure.

Nov 14 '23 23:11 Jeffwan

The test added by #1741 is flaky due to the stale informer cache. See #1746 for more details.

Dec 13 '23 20:12 kevin85421

em. introducing expectation definitely improve the complexity a lot. We use similar strategy in kubeflow training in the past. Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating. I kind of feel the most easiest fix at this moment is to request the apiserver instead. while, it adds some burden for sure.

This pretty makes sense.

Dec 13 '23 20:12 kevin85421

Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating.

In principle, writing it from scratch is an option :)

Dec 13 '23 20:12 DmitriGekhtman

Closed by #2150

Nov 25 '24 07:11 kevin85421

[Bug] RayCluster controller operator does not handle stale informer cache

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

ReplicaSet Controller

Summary