Bundle status too big for etcd
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
I have updated a fleet controller managing ~ 72 clusters to Fleet 0.11.5 and noticed a lot of reconciler errors in the fleet-controller pod logs that look like this:
{
"level": "error",
"ts": "2025-03-27T10:43:27Z",
"msg": "Reconciler error",
"controller": "bundle",
"controllerGroup": "fleet.cattle.io",
"controllerKind": "Bundle",
"Bundle": {
"name": "xyz-kube-prometheus-stack",
"namespace": "xyz"
},
"namespace": "xyz",
"name": "xyz-kube-prometheus-stack",
"reconcileID": "9eacefc5-fb1d-45cb-b5fe-2fc7de6bdb00",
"error": "etcdserver: request is too large",
"stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"
}
I then enabled debug logging and looked at the request body that fleet was trying to send to etcd and noticed that the new bundle status yaml has more than 30k lines because .status.resourceKey now contains an array with every single resource being deployed which in my case means "72 clusters * 100 resources = 7200 entries". This status alone has a size of 1MB. Combined with the rest of the resource, this exceeds etcd's limits. See this attachment: status.resourceKey.yaml.txt
These are pretty rough numbers and I know that I might have even missed quite some lines from the log while trying to extract the request.
Previously, we were running a 0.9.x version which only had 100 resources in the status.
I am not sure if this is intended behaviour but I guess for larger deployments, this is an issue.
Expected Behavior
A bundle status should not cause the entire resource to become bigger than etcd's limits.
Steps To Reproduce
- Install fleet to some cluster
- Make it manage a large number of downstream clusters (in this case 72)
- Make it deploy https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack on all of them
- Observe reconciler errors in fleet-controller pod
Environment
- Architecture: amd64
- Fleet Version: 0.11.5
- Cluster:
- Provider: GKE
- Options:
- Kubernetes Version: 1.32.2
Logs
Anything else?
No response
https://github.com/rancher/fleet/issues/3379 Same kind of issue?
found this: https://github.com/rancher/fleet/issues/1101 https://github.com/rancher/fleet/issues/2115
optionally disable resourceKey generation for fleet-standalone, as it's only used in the UI --> would be nice!
Thanks for the report, we're painfully aware of the limitations when using Fleet with etcd.
To resolve this, you can either
- update the etcd configuration to allow larger requests
- switch to OCI Storage instead of etcd (needs latest Fleet)
Thanks for the suggestions, @kkaempf - I think OCI storage might be an option in the future but I think it won't help when the status subresource becomes too big.
I think optionally disabling resourceKey generation would be an interesting middle ground here. If you are open for this option, we will gladly create a PR for this :)
Not sure if such a PR would be acceptable as the issue is very specific and limited.
Flagging as feature, to be reviewed in one of the upcoming planning meetings.
To be honest, I'm not sure if "specific and limited" is the case here. kube-prometheus-stack is a chart that is commonly used in the industry and 70 managed clusters is not that much.
Nevertheless I'm looking forward to the result of your planning :)
Can this be re-tested with https://github.com/rancher/fleet/releases/tag/v0.12.0 (see "Resources in Status Fields:")?
@aruiz14 worked on a number of PRs that address the resources list in the status.
Regarding OCI storage, it would help with the Bundle.Spec.Resources list, which contains every resource to be deployed. But only once, not per cluster, so while it is bad, its size only depends on the input chart. If this list is big, changing the bundle often in a short time (<5min) will indeed lead to etc db size issues.
The Bundle.Status.ResourceKey is an estimation of what could be deployed. It is misleading and we need to check if we can deprecate it. So, disabling it for standalone with an installation option (env var) is very interesting to me.
Can this be re-tested with https://github.com/rancher/fleet/releases/tag/v0.12.0 (see "Resources in Status Fields:")?
@aruiz14 worked on a number of PRs that address the resources list in the status.
I will try 0.12 asap. My last attempt at upgrading unfortunately failed because Fleet then had issues deserializing GitRepos. I will try the upgrade once more - if it fails again I might need to open another issue.
@webD97 https://github.com/rancher/fleet/issues/3501#issuecomment-2780648657 likely this one.
I upgraded a cluster managing two downstreams clusters to Fleet 0.12 but .status.resourceKey is still there. My bundle for kube-prometheus-stack still lists ~200 entries in there (2 clusters * ~100 resources). So 0.12 does not seem to improve on this specific issue.
We believe, we might be able to remove bundle.status.resourceKey and the UI can be switched to bundledeployment.status.resources instead. This would also make the information in the UI more accurate.
- https://github.com/rancher/fleet/blob/main/internal/cmd/controller/reconciler/bundle_status.go#L36-L56
- https://github.com/rancher/dashboard/blob/release-2.11/shell/utils/fleet.ts#L106
Additional QA
Problem
Bundle resources may become too large to be stored in etcd, leading to reconciler errors.
Solution
Resource keys have been removed from bundles, in an attempt to reduce bundle sizes.
Testing
This has already been validated from a UI point of view; see this comment.
Engineering Testing
Manual Testing
N/A
Automated Testing
QA Testing Considerations
We need to validate that bundle resources no longer contain resource keys. The Resource Key field should still exist as it has merely been deprecated, but it should stay empty and no longer be populated.
Regressions Considerations
N/A
System Information
| Rancher Version | Fleet Version |
|---|---|
| v2.12.0-alpha9 | 107.0.0+up0.13.0-alpha.5 |
Steps followed
-
Created
GitRepowhich creates any resource. -
Wait for resources to be ready.
-
Navigate to Bundles.
-
Create same
GitRepoon 2.11-head cluster. -
Wait for resources to be ready.
-
Navigate to Bundles.
Note: In order to check resourceKey present, I used same GitRepo on both clusters.
✅ Verified that resourceKey is not present in 2.12-head.
Bundle YAML file
Bundle YAML file contains resourceKey from 2.11-head
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
creationTimestamp: '2025-06-16T13:37:11Z'
finalizers:
- fleet.cattle.io/bundle-finalizer
generation: 1
labels:
fleet.cattle.io/commit: acb76ea6478e95e2c4e85c767e9288e40bd97ff2
fleet.cattle.io/created-by-display-name: admin
fleet.cattle.io/created-by-user-id: user-5c54b
fleet.cattle.io/repo-name: test-gitrepo
name: test-gitrepo-nginx
namespace: fleet-default
resourceVersion: '7759'
uid: fe8c3069-0cc3-42c9-a0c8-2236c43ae133
spec:
defaultNamespace: nginx
ignore: {}
resources:
- content: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-keep-3
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
name: nginx.yaml
targetRestrictions:
- clusterSelector:
matchExpressions:
- key: provider.cattle.io
operator: NotIn
values:
- harvester
targets:
- clusterSelector:
matchExpressions:
- key: provider.cattle.io
operator: NotIn
values:
- harvester
ignore: {}
status:
conditions:
- lastUpdateTime: '2025-06-16T13:37:20Z'
status: 'True'
type: Ready
display:
readyClusters: 3/3
maxNew: 50
maxUnavailable: 3
observedGeneration: 1
partitions:
- count: 3
maxUnavailable: 3
name: All
summary:
desiredReady: 3
ready: 3
resourceKey:
- apiVersion: apps/v1
kind: Deployment
name: nginx-keep-3
namespace: nginx
- apiVersion: apps/v1
kind: Deployment
name: nginx-keep-3
namespace: nginx
- apiVersion: apps/v1
kind: Deployment
name: nginx-keep-3
namespace: nginx
resourcesSha256Sum: 9db00bfee6b1e6fbb6778e3e1d4884196ce27e284ae2ed52d1542ca4a6deeabd
summary:
desiredReady: 3
ready: 3
unavailable: 0
Bundle YAML file without resourceKey from 2.12-head
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
creationTimestamp: '2025-06-16T13:46:07Z'
finalizers:
- fleet.cattle.io/bundle-finalizer
generation: 1
labels:
fleet.cattle.io/commit: acb76ea6478e95e2c4e85c767e9288e40bd97ff2
fleet.cattle.io/created-by-display-name: admin
fleet.cattle.io/created-by-user-id: user-vrfk7
fleet.cattle.io/repo-name: test-girepo-212
name: test-girepo-212-nginx
namespace: fleet-default
resourceVersion: '65439'
uid: 0750dbbb-9828-477b-972b-910e00c2fd3b
spec:
defaultNamespace: nginx
ignore: {}
resources:
- content: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-keep-3
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
name: nginx.yaml
targetRestrictions:
- clusterSelector:
matchExpressions:
- key: provider.cattle.io
operator: NotIn
values:
- harvester
targets:
- clusterSelector:
matchExpressions:
- key: provider.cattle.io
operator: NotIn
values:
- harvester
ignore: {}
status:
conditions:
- lastUpdateTime: '2025-06-16T13:46:12Z'
status: 'True'
type: Ready
display:
readyClusters: 3/3
maxNew: 50
maxUnavailable: 3
observedGeneration: 1
partitions:
- count: 3
maxUnavailable: 3
name: All
summary:
desiredReady: 3
ready: 3
resourcesSha256Sum: 9db00bfee6b1e6fbb6778e3e1d4884196ce27e284ae2ed52d1542ca4a6deeabd
summary:
desiredReady: 3
ready: 3
unavailable: 0