[Bug] RayCluster controller operator does not handle stale informer cache
Search before asking
- [x] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When reconciling pod quantities, the RayCluster controller uses pod informer cache as the sole source of truth on current pod state. The number of pods to create or delete are computed as the difference between the RayCluster spec's replica count and the number of pods listed from the cache.
The cache is frequently stale and there's currently no accounting mechanism to try to maintain read-after-write consistency. As a result, situations arise where more pods are launched than necessary and then a random pod is deleted to readjust.
See https://github.com/ray-project/kuberay/issues/704 for logs from an example.
This is especially problematic for Ray since Ray worker pods can store application-critical state.
Reproduction script
Create several Ray clusters and check the operator logs for spurious pod creation followed by deletion.
Anything else
There's a bit of discussion how the replica set controller handles this here: https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332 Replica set controller code: https://github.com/kubernetes/kubernetes/blob/29ddedae1d0a07dfc7a89693daf3dbf4be766be3/pkg/controller/replicaset/replica_set.go#L367-L368 Note that "expectations" are tracked when a pod is created.
Kruise CloneSet pod reconciliation logic is here: https://github.com/openkruise/kruise/blob/5839c5ba1909b463242a3603a7d22d45aed7fa6e/pkg/controller/cloneset/cloneset_controller.go#L225-L280
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
cc @akanso @Jeffwan
Someone should probably take a closer look at the replica set code -- similar methods can be applied to KubeRay pod reconciliation.
Thank you for opening a new thread to highlight this issue! I plan to dive deep into this issue.
ReplicaSet Controller
- https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/replicaset/replica_set.go
ReplicaSet (RC) controller maintains expectations (A TTLCache of pod creates/deletes each RC expects to see). If expectations.add or expectations.del is larger than 0, the reconciler will skip synchronization with the RC because it is waiting for some Pod add events or del events.
Take #704 as an example to explain how expectations avoid double creation of Pods, in the first iteration of reconcilePods, diff is equal to 1 (CR specifies 1 worker Pod, but no Pod is running.), we set expectations.add to 1 because we expect 1 Pod add event.
https://github.com/ray-project/kuberay/blob/0579c90d8e807cfa866b175e27beb9afc5526f89/ray-operator/controllers/ray/raycluster_controller.go#L479
In the second iteration of reconcilePods, although the worker pod is not created (inconsistency between Kubernetes API server and local informer cache), the synchronization will be skipped because expectations.add is larger than 0.
When the informer cache receives the Pod add event from Kubernetes API server, it decreases expectations.add by 1 in the Pod's resource event handler. Then, the synchronization will not be skipped, and diff will be 0 (no more Pod needs to be created.).
Summary
TLDR: We can solve this issue by maintaining expectations.
- Pros: We have more flexibility to support our requirements.
- Cons:
- Increase system complexity (contributors need to be familiar with how
expectationswork) => more bugs - We need to implement scale-up/down, update, and deletion by ourselves.
- Increase system complexity (contributors need to be familiar with how
Instead of the replicaset solution with non-deterministic pod names and a cache of the controller's actions, it could be simpler to take the statefulset solution and use deterministic pod names -- then double-creation of pods due to cache staleness would manifest as a transient 409 "already exists" conflict.
Instead of the replicaset solution with non-deterministic pod names and a cache of the controller's actions, it could be simpler to take the statefulset solution and use deterministic pod names -- then double-creation of pods due to cache staleness would manifest as a transient 409 "already exists" conflict.
Good point. Maintaining the names of Pods may be challenging, especially with Ray's autoscaling feature.
However, it might bring more trouble when it comes to failover, for instance, we name each pod as "${workerGroupsName}-${idx}", when pod gets OOM issue, controller will restart it using same pod name and hostname, it might confuse the GCS wether it is the old one just reconnect, or a new worker node.
I don't think the gcs currently knows the name of the pod. But, conceivably, you could try to increment indices such that you never recreate a pod with the same name.
em. introducing expectation definitely improve the complexity a lot. We use similar strategy in kubeflow training in the past. Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating. I kind of feel the most easiest fix at this moment is to request the apiserver instead. while, it adds some burden for sure.
The test added by #1741 is flaky due to the stale informer cache. See #1746 for more details.
em. introducing expectation definitely improve the complexity a lot. We use similar strategy in kubeflow training in the past. Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating. I kind of feel the most easiest fix at this moment is to request the apiserver instead. while, it adds some burden for sure.
This pretty makes sense.
Th tricky thing is expectation package is not in the controller tools and we have to sync the codes from kubernetes core and it's keep updating.
In principle, writing it from scratch is an option :)
Closed by #2150