[SURE-3711] Long time for changes to propagate from GitOps to cluster state
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
We are experiencing very slow Bundle deployments in one our largest clusters and we can see that most time is spent in WaitApplied, and we see up to 10-15 minutes for a Bundle to transition back to Ready after modification.
When we measure time, we ignore any actions that require fetching container images etc, e.g changing replicaCount: 0 takes 10-15 minutes to propagate from PR merge in our GitOps repo, until it is actually applied in the cluster.
It is not clear at the moment where the time is wasted, but we suspect it is the fleet-agent in relation to the scale (number of Bundles) we have.
We have several smaller clusters that do not experience the same problem.
Expected Behavior
Would expect changes to propagate in less then 60 seconds assuming they do not require to fetch new images etc. This should be reasonable?
This is from PR merge on GitOps repo, to cluster state update, and bundle in Ready state again.
Steps To Reproduce
Not clear. This is most likely related to our environment and scale. See more in next section.
Environment
- Architecture: amd64
- Fleet Version: v0.3.11 2958e9b
- Cluster:
- Provider: Rancher + k3s
- Options:
- 450 nodes
- 1260 Bundles
- 4400 pods
- 5697 resources
- Kubernetes Version: v1.22.12+k3s1
Logs
As we have a very large number of Bundles, I will defer to pasting output that might not be relevant.
Happy to provide specific logs if deemed necessary to debug this issue further
Anything else?
Based on our observations, we do not see to be limited by CPU/RAM/Network bandwidth on our "master" nodes where fleet-agent is running we really have monster of machines :)
Would appreciate to get some feedback from the fleet team on this issue.
Interesting. Are you using webhooks or do you set the pollingInterval for git repos?
And do you use a "rolloutStrategy" with partitions?
The status of the affected bundle would be interesting. Bundle resources are created by the gitjob. Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied. The bundledeployment references a content resource, the reference should be to the latest one.
Only then the agent would be able to deploy the bundle on a downstream cluster.
The calculation of WaitApplied is very involved and called from several places: https://github.com/rancher/fleet/blob/master/pkg/summary/summary.go#L16 Maybe a tuned rollout strategy can help? Are there any failed/modified bundles?
Are you using webhooks or do you set the pollingInterval for git repos?
We are using pollingInterval, every 5 seconds I belive.
And do you use a "rolloutStrategy" with partitions?
No, so defaults should apply.
The status of the affected bundle would be interesting. Bundle resources are created by the gitjob.
The mentioned time (10-15 minutes) is when monitoring the Bundle resource.
Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied.
Will have a closer look at the BundleDeployment and come back to you.
Maybe a tuned rollout strategy can help?
cluster partitioning?
Let me describe our setup in a bit more detail.
We have one Rancher/Fleet instance managing 2 cluster groups with a total of 10 clusters.
- eks - AWS managed clusters - we have not had issues here and faily low volume
so lets not focus on this one
- 4 clusters
- k3s - self hosted k3s clusters
- 6 clusters
We have 6 GitRepo objects (one for each cluster).
We also a "generic" GitRepo that applies changes to all cluster.
The k3s clusters:
| Name | Resources | Nodes | Deployments |
|---|---|---|---|
| Cluster 1 | 425 | 7 | 27 |
| Cluster 2 | 3771 | 436 | 1073 |
| Cluster 3 | 1561 | 83 | 342 |
| Cluster 4 | 1108 | 171 | 220 |
| Cluster 5 | 2183 | 231 | 1096 |
| Cluster 6 | 348 | 8 | 29 |
This means that the same fleet-controller manages all these clusters.
We are only experiencing problems with Cluster 2, this is why I would suspect
that it is downstream problem, but we have not been able to pinpoint it.
Are there any failed/modified bundles?
Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.
SURE-3711
Let's install a cluster with a about a hundred nodes and try to replicate this.
Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.
That's interesting. I would have expected that this reduces the overall number of events. Probably not enough.
In Fleet < 0.10 the agent reconciler, who installs the bundles on clusters, only has 5 workers to accept events. When a gitrepo creates say 50 bundles for that agent, it will work on 5 in parallel. (Resources change more than once during an installation, so it's a bit worse.)
Fleet 0.10.3/0.10.4 will increase this number. We plan to make it configurable in the future.
@mirzak In case you're still around for this old issue, can you retry once 2.9.3 (Fleet >=0.10.3) is released?
I no longer work on that particular project but @tmartensson might be interested.
Cleaning up the backlog, might be fixed in current versions.