fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Bundle status too big for etcd

Open webD97 opened this issue 9 months ago • 12 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

I have updated a fleet controller managing ~ 72 clusters to Fleet 0.11.5 and noticed a lot of reconciler errors in the fleet-controller pod logs that look like this:

{
    "level": "error",
    "ts": "2025-03-27T10:43:27Z",
    "msg": "Reconciler error",
    "controller": "bundle",
    "controllerGroup": "fleet.cattle.io",
    "controllerKind": "Bundle",
    "Bundle": {
        "name": "xyz-kube-prometheus-stack",
        "namespace": "xyz"
    },
    "namespace": "xyz",
    "name": "xyz-kube-prometheus-stack",
    "reconcileID": "9eacefc5-fb1d-45cb-b5fe-2fc7de6bdb00",
    "error": "etcdserver: request is too large",
    "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224"
}

I then enabled debug logging and looked at the request body that fleet was trying to send to etcd and noticed that the new bundle status yaml has more than 30k lines because .status.resourceKey now contains an array with every single resource being deployed which in my case means "72 clusters * 100 resources = 7200 entries". This status alone has a size of 1MB. Combined with the rest of the resource, this exceeds etcd's limits. See this attachment: status.resourceKey.yaml.txt

These are pretty rough numbers and I know that I might have even missed quite some lines from the log while trying to extract the request.

Previously, we were running a 0.9.x version which only had 100 resources in the status.

I am not sure if this is intended behaviour but I guess for larger deployments, this is an issue.

Expected Behavior

A bundle status should not cause the entire resource to become bigger than etcd's limits.

Steps To Reproduce

  1. Install fleet to some cluster
  2. Make it manage a large number of downstream clusters (in this case 72)
  3. Make it deploy https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack on all of them
  4. Observe reconciler errors in fleet-controller pod

Environment

- Architecture: amd64
- Fleet Version: 0.11.5
- Cluster:
  - Provider: GKE
  - Options:
  - Kubernetes Version: 1.32.2

Logs


Anything else?

No response

webD97 avatar Mar 27 '25 10:03 webD97

https://github.com/rancher/fleet/issues/3379 Same kind of issue?

johnjcool avatar Mar 27 '25 12:03 johnjcool

found this: https://github.com/rancher/fleet/issues/1101 https://github.com/rancher/fleet/issues/2115

optionally disable resourceKey generation for fleet-standalone, as it's only used in the UI --> would be nice!

johnjcool avatar Mar 27 '25 18:03 johnjcool

Thanks for the report, we're painfully aware of the limitations when using Fleet with etcd.

To resolve this, you can either

kkaempf avatar Mar 31 '25 15:03 kkaempf

Thanks for the suggestions, @kkaempf - I think OCI storage might be an option in the future but I think it won't help when the status subresource becomes too big.

I think optionally disabling resourceKey generation would be an interesting middle ground here. If you are open for this option, we will gladly create a PR for this :)

webD97 avatar Mar 31 '25 15:03 webD97

Not sure if such a PR would be acceptable as the issue is very specific and limited.

Flagging as feature, to be reviewed in one of the upcoming planning meetings.

kkaempf avatar Mar 31 '25 15:03 kkaempf

To be honest, I'm not sure if "specific and limited" is the case here. kube-prometheus-stack is a chart that is commonly used in the industry and 70 managed clusters is not that much.

Nevertheless I'm looking forward to the result of your planning :)

webD97 avatar Mar 31 '25 16:03 webD97

Can this be re-tested with https://github.com/rancher/fleet/releases/tag/v0.12.0 (see "Resources in Status Fields:")?

@aruiz14 worked on a number of PRs that address the resources list in the status.

manno avatar Apr 09 '25 08:04 manno

Regarding OCI storage, it would help with the Bundle.Spec.Resources list, which contains every resource to be deployed. But only once, not per cluster, so while it is bad, its size only depends on the input chart. If this list is big, changing the bundle often in a short time (<5min) will indeed lead to etc db size issues.

The Bundle.Status.ResourceKey is an estimation of what could be deployed. It is misleading and we need to check if we can deprecate it. So, disabling it for standalone with an installation option (env var) is very interesting to me.

manno avatar Apr 09 '25 12:04 manno

Can this be re-tested with https://github.com/rancher/fleet/releases/tag/v0.12.0 (see "Resources in Status Fields:")?

@aruiz14 worked on a number of PRs that address the resources list in the status.

I will try 0.12 asap. My last attempt at upgrading unfortunately failed because Fleet then had issues deserializing GitRepos. I will try the upgrade once more - if it fails again I might need to open another issue.

webD97 avatar Apr 10 '25 06:04 webD97

@webD97 https://github.com/rancher/fleet/issues/3501#issuecomment-2780648657 likely this one.

manno avatar Apr 10 '25 12:04 manno

I upgraded a cluster managing two downstreams clusters to Fleet 0.12 but .status.resourceKey is still there. My bundle for kube-prometheus-stack still lists ~200 entries in there (2 clusters * ~100 resources). So 0.12 does not seem to improve on this specific issue.

webD97 avatar Apr 11 '25 06:04 webD97

We believe, we might be able to remove bundle.status.resourceKey and the UI can be switched to bundledeployment.status.resources instead. This would also make the information in the UI more accurate.

  • https://github.com/rancher/fleet/blob/main/internal/cmd/controller/reconciler/bundle_status.go#L36-L56
  • https://github.com/rancher/dashboard/blob/release-2.11/shell/utils/fleet.ts#L106

manno avatar Apr 11 '25 11:04 manno

Additional QA

Problem

Bundle resources may become too large to be stored in etcd, leading to reconciler errors.

Solution

Resource keys have been removed from bundles, in an attempt to reduce bundle sizes.

Testing

This has already been validated from a UI point of view; see this comment.

Engineering Testing

Manual Testing

N/A

Automated Testing

QA Testing Considerations

We need to validate that bundle resources no longer contain resource keys. The Resource Key field should still exist as it has merely been deprecated, but it should stay empty and no longer be populated.

Regressions Considerations

N/A

weyfonk avatar Jun 11 '25 08:06 weyfonk

System Information

Rancher Version Fleet Version
v2.12.0-alpha9 107.0.0+up0.13.0-alpha.5

Steps followed

  • Created GitRepo which creates any resource.

  • Wait for resources to be ready.

  • Navigate to Bundles.

  • Create same GitRepo on 2.11-head cluster.

  • Wait for resources to be ready.

  • Navigate to Bundles.

Note: In order to check resourceKey present, I used same GitRepo on both clusters.

✅ Verified that resourceKey is not present in 2.12-head.


Bundle YAML file

Bundle YAML file contains resourceKey from 2.11-head
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
  creationTimestamp: '2025-06-16T13:37:11Z'
  finalizers:
    - fleet.cattle.io/bundle-finalizer
  generation: 1
  labels:
    fleet.cattle.io/commit: acb76ea6478e95e2c4e85c767e9288e40bd97ff2
    fleet.cattle.io/created-by-display-name: admin
    fleet.cattle.io/created-by-user-id: user-5c54b
    fleet.cattle.io/repo-name: test-gitrepo
  name: test-gitrepo-nginx
  namespace: fleet-default
  resourceVersion: '7759'
  uid: fe8c3069-0cc3-42c9-a0c8-2236c43ae133
spec:
  defaultNamespace: nginx
  ignore: {}
  resources:
    - content: |
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: nginx-keep-3
          labels:
            app: nginx
        spec:
          replicas: 3
          selector:
            matchLabels:
              app: nginx
          template:
            metadata:
              labels:
                app: nginx
            spec:
              containers:
              - name: nginx
                image: nginx:1.14.2
                ports:
                - containerPort: 80
      name: nginx.yaml
  targetRestrictions:
    - clusterSelector:
        matchExpressions:
          - key: provider.cattle.io
            operator: NotIn
            values:
              - harvester
  targets:
    - clusterSelector:
        matchExpressions:
          - key: provider.cattle.io
            operator: NotIn
            values:
              - harvester
      ignore: {}
status:
  conditions:
    - lastUpdateTime: '2025-06-16T13:37:20Z'
      status: 'True'
      type: Ready
  display:
    readyClusters: 3/3
  maxNew: 50
  maxUnavailable: 3
  observedGeneration: 1
  partitions:
    - count: 3
      maxUnavailable: 3
      name: All
      summary:
        desiredReady: 3
        ready: 3
  resourceKey:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx-keep-3
      namespace: nginx
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx-keep-3
      namespace: nginx
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx-keep-3
      namespace: nginx
  resourcesSha256Sum: 9db00bfee6b1e6fbb6778e3e1d4884196ce27e284ae2ed52d1542ca4a6deeabd
  summary:
    desiredReady: 3
    ready: 3
  unavailable: 0


Bundle YAML file without resourceKey from 2.12-head
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
  creationTimestamp: '2025-06-16T13:46:07Z'
  finalizers:
    - fleet.cattle.io/bundle-finalizer
  generation: 1
  labels:
    fleet.cattle.io/commit: acb76ea6478e95e2c4e85c767e9288e40bd97ff2
    fleet.cattle.io/created-by-display-name: admin
    fleet.cattle.io/created-by-user-id: user-vrfk7
    fleet.cattle.io/repo-name: test-girepo-212
  name: test-girepo-212-nginx
  namespace: fleet-default
  resourceVersion: '65439'
  uid: 0750dbbb-9828-477b-972b-910e00c2fd3b
spec:
  defaultNamespace: nginx
  ignore: {}
  resources:
    - content: |
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: nginx-keep-3
          labels:
            app: nginx
        spec:
          replicas: 3
          selector:
            matchLabels:
              app: nginx
          template:
            metadata:
              labels:
                app: nginx
            spec:
              containers:
              - name: nginx
                image: nginx:1.14.2
                ports:
                - containerPort: 80
      name: nginx.yaml
  targetRestrictions:
    - clusterSelector:
        matchExpressions:
          - key: provider.cattle.io
            operator: NotIn
            values:
              - harvester
  targets:
    - clusterSelector:
        matchExpressions:
          - key: provider.cattle.io
            operator: NotIn
            values:
              - harvester
      ignore: {}
status:
  conditions:
    - lastUpdateTime: '2025-06-16T13:46:12Z'
      status: 'True'
      type: Ready
  display:
    readyClusters: 3/3
  maxNew: 50
  maxUnavailable: 3
  observedGeneration: 1
  partitions:
    - count: 3
      maxUnavailable: 3
      name: All
      summary:
        desiredReady: 3
        ready: 3
  resourcesSha256Sum: 9db00bfee6b1e6fbb6778e3e1d4884196ce27e284ae2ed52d1542ca4a6deeabd
  summary:
    desiredReady: 3
    ready: 3
  unavailable: 0

sbulage avatar Jun 16 '25 14:06 sbulage