fleet icon indicating copy to clipboard operation
fleet copied to clipboard

BundleDeployment has not been updated 3 hours since the bundle is being updated.

Open aDisplayName opened this issue 2 months ago • 2 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

We have observed, in our setup, that 3 hours after the content in the git repository has been updated, the change has not been deployed into our target cluster.

The fleet.yaml in our git repository is being updated fairly frequently by the CI/CD pipeline, a dozen times per hour,

In Rancher local server, we checked the bundle secret in fleet-default namespace, at it is always the latest:

Image

But spot checked a few bundle deployment secrets, they are at least 3 hours late:

Image Image

Which matches to the content deployed to the downstream clusters.

Expected Behavior

We would expect the corresponding bundledeployment for each target should always be up-to-date

Steps To Reproduce

No response

Environment

- Rancher: v2.12.1
  - Provider: GKE
  - Kubernetes Version: v1.32.9-gke.1108000
- Fleet Version: 0.13.1
- Downstream Cluster:
  - Provider: k3s
  - Options: 
  - Kubernetes Version: 1.28.1
- Rancher

Logs


Anything else?

The other Fleet GitRepos in the same rancher server doesn't seem to have the same deploy problem. Update of the change are always deployed to downstream clusters in a reasonable time line.

We are using gitjob to trigger the GitRepo update with source git repositories are hosted in azure devops.

Can you tell us where to get the required log?

aDisplayName avatar Oct 23 '25 20:10 aDisplayName

We Recreated the GitRepo with a different GitRepo object name, now the delay seems reduced to a few minutes. (targeting 100+ clusters).

The rancher server was upgraded from 2.10.1 to 2.12.2 directly, then we saw the fleet controller was not upgraded correctly (Still stuck in 0.11.x), so we downgraded rancher server to 2.11.x, and then upgraded again to 2.12.1.

aDisplayName avatar Oct 29 '25 20:10 aDisplayName

The rancher server was upgraded from 2.10.1 to 2.12.2 directly, then we saw the fleet controller was not upgraded correctly (Still stuck in 0.11.x), so we downgraded rancher server to 2.11.x, and then upgraded again to 2.12.1.

Skipping minor versions on upgrades is not tested and not supported.

kkaempf avatar Oct 30 '25 13:10 kkaempf

Closing this, as the issue seems resolved. If not, please provide more information.

weyfonk avatar Nov 17 '25 10:11 weyfonk

We had another instance rancher server starting with 2.10.4, and we performed the upgrade using the following path: 2.10.4 → 2.11.3 → 2.12.3, the same behavior showed up again.

aDisplayName avatar Dec 10 '25 23:12 aDisplayName

I am indeed on 2.11.3 and I am still experiencing this. I have cases where even after 15h this is not updated and the only way forward is to perform a force update.

matteotumiati avatar Dec 11 '25 08:12 matteotumiati

Hi,

here are my findings:

If there's an error on the yaml(published on git) <-> rancher (eg: yaml have different resourceVersion or different type of field or wrong format) at some point, then the bundle stop trying to update.

The error on deploy happens, then Gitjob receives the new commits after a error (probably with fixes):

{"level":"info","ts":"2025-12-11T18:54:57Z","logger":"gitjob","msg":"New commit from repository","controller":"gitrepo","controllerGroup":"fleet.cattle.io","controllerKind":"GitRepo","GitRepo":{"name":"my-scripts-local","namespace":"fleet-local"},"namespace":"fleet-local","name":"my-scripts-local","reconcileID":"07f6489c-583a-428d-a631-0fd385cd9de4","generation":3,"commit":"d8973ad86c2154a5c7e83706f2fe53c192dfef56","conditions":[{"type":"Ready","status":"False","lastUpdateTime":"2025-12-11T17:20:44Z","message":"OutOfSync(1) [Cluster fleet-local/local]"},{"type":"GitPolling","status":"True","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Reconciling","status":"False","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Stalled","status":"False","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Accepted","status":"True","lastUpdateTime":"2025-12-11T14:39:08Z"}],"newCommit":"1f53b786bad18ca9bf0e7dacfe6f1da0bff795da"}
{"level":"info","ts":"2025-12-11T18:55:04Z","logger":"gitjob","msg":"job deletion triggered because job succeeded","controller":"gitrepo","controllerGroup":"fleet.cattle.io","controllerKind":"GitRepo","GitRepo":{"name":"my-scripts-local","namespace":"fleet-local"},"namespace":"fleet-local","name":"my-scripts-local","reconcileID":"328174c3-2d07-48fc-941b-0d3a95ec3d8c","generation":3,"commit":"1f53b786bad18ca9bf0e7dacfe6f1da0bff795da","conditions":[{"type":"Ready","status":"False","lastUpdateTime":"2025-12-11T17:20:44Z","message":"OutOfSync(1) [Cluster fleet-local/local]"},{"type":"GitPolling","status":"True","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Reconciling","status":"False","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Stalled","status":"False","lastUpdateTime":"2025-12-11T14:39:08Z"},{"type":"Accepted","status":"True","lastUpdateTime":"2025-12-11T14:39:08Z"}]}

But then the controller cannot see that there are changes on the commit:

{"level":"info","ts":"2025-12-11T18:55:40Z","logger":"bundle","msg":"Unchanged bundledeployment","controller":"bundle","controllerGroup":"fleet.cattle.io","controllerKind":"Bundle","Bundle":{"name":"my-scripts-new-dir","namespace":"fleet-default"},"namespace":"fleet-default","name":"my-scripts-new-dir","reconcileID":"88de00cd-3002-4ef6-9e5a-182f7f55e28c","gitrepo":"my-scripts","commit":"1f53b786bad18ca9bf0e7dacfe6f1da0bff795da","manifestID":"s-79c37bd6d1567ce8bddf740c62fa783a62a7b5acc70529634a2eca721f229","bundledeployment":{"metadata":{"name":"my-scripts-new-dir","namespace":"cluster-fleet-default-load-cluster-01-bce450bbbeab","creationTimestamp":null,"labels":{"fleet.cattle.io/bundle-name":"my-scripts-new-dir","fleet.cattle.io/bundle-namespace":"fleet-default","fleet.cattle.io/cluster":"load-cluster-01","fleet.cattle.io/cluster-namespace":"fleet-default","fleet.cattle.io/commit":"1f53b786bad18ca9bf0e7dacfe6f1da0bff795da","fleet.cattle.io/created-by-display-name":"admin","fleet.cattle.io/created-by-user-id":"user-jqj7s","fleet.cattle.io/managed":"true","fleet.cattle.io/repo-name":"my-scripts"},"finalizers":["fleet.cattle.io/bundle-deployment-finalizer"]},"spec":{"stagedOptions":{"helm":{},"forceSyncGeneration":5,"ignore":{}},"stagedDeploymentID":"s-79c37bd6d1567ce8bddf740c62fa783a62a7b5acc70529634a2eca721f229:cf0755d3a6989987818f83f4e2442479fb2715fc8d68557d52605d86c7841dbe","options":{"helm":{},"forceSyncGeneration":5,"ignore":{}},"deploymentID":"s-79c37bd6d1567ce8bddf740c62fa783a62a7b5acc70529634a2eca721f229:cf0755d3a6989987818f83f4e2442479fb2715fc8d68557d52605d86c7841dbe"},"status":{"display":{},"resourceCounts":{"ready":0,"desiredReady":0,"waitApplied":0,"modified":0,"orphaned":0,"missing":0,"unknown":0,"notReady":0}}},"deploymentID":"s-79c37bd6d1567ce8bddf740c62fa783a62a7b5acc70529634a2eca721f229:cf0755d3a6989987818f83f4e2442479fb2715fc8d68557d52605d86c7841dbe","operation":"unchanged"}

The logs I shared here are from my tests.

Thanks.

marytlf avatar Dec 11 '25 21:12 marytlf

@marytlf which version of Rancher (resp. Fleet) are you testing ?

kkaempf avatar Dec 12 '25 10:12 kkaempf

@kkaempf Rancher: v2.11.3 fleet: v0.12.4

marytlf avatar Dec 12 '25 13:12 marytlf

@marytlf hmm, that's already pretty old. It would be great if you could re-test with a more recent version (like Rancher 2.13.0, or 2.13.1 in about a week).

kkaempf avatar Dec 12 '25 13:12 kkaempf

@kkaempf

With rancher's version v2.12.3 I got a similar error (with a plus).

The update takes longer than the seconds added on pollingInterval (I set it to 150s and it only updated way longer than that). See the screenshot:

Image

If there's an error on the resource being deployed, the latest commit is not applied and the error/status continues with error.

This error is presented on fleet-controller log:

`{"level":"error","ts":"2025-12-12T15:12:22Z","msg":"Reconciler error","controller":"bundle","controllerGroup":"fleet.cattle.io","controllerKind":"Bundle","Bundle":{"name":"my-scripts-cl2-new-dir-2","namespace":"fleet-default"},"namespace":"fleet-default","name":"my-scripts-cl2-new-dir-2","reconcileID":"cf3ab6f1-b17a-4d2b-9c1c-3b31f2ec3f4d","error":"failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:353\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202"}`

Then if I force the update on the UI, I get the same error on logs:

{"level":"info","ts":"2025-12-12T15:21:09Z","logger":"bundle","msg":"Updated bundledeployment","controller":"bundle","controllerGroup":"fleet.cattle.io","controllerKind":"Bundle","Bundle":{"name":"my-scripts-cl2-new-dir-2","namespace":"fleet-default"},"namespace":"fleet-default","name":"my-scripts-cl2-new-dir-2","reconcileID":"05d9c32b-bcbe-42b2-9435-f1af763c0e34","gitrepo":"my-scripts-cl2","commit":"c7f33dc3d395f988af499ed4b482f54413b0eaa5","manifestID":"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34","bundledeployment":{"metadata":{"name":"my-scripts-cl2-new-dir-2","namespace":"cluster-fleet-default-load-cluster-01-bce450bbbeab","creationTimestamp":null,"labels":{"fleet.cattle.io/bundle-name":"my-scripts-cl2-new-dir-2","fleet.cattle.io/bundle-namespace":"fleet-default","fleet.cattle.io/cluster":"load-cluster-01","fleet.cattle.io/cluster-namespace":"fleet-default","fleet.cattle.io/commit":"c7f33dc3d395f988af499ed4b482f54413b0eaa5","fleet.cattle.io/created-by-user-id":"user-lkxfx","fleet.cattle.io/managed":"true","fleet.cattle.io/repo-name":"my-scripts-cl2"},"finalizers":["fleet.cattle.io/bundle-deployment-finalizer"]},"spec":{"stagedOptions":{"helm":{},"forceSyncGeneration":3,"keepResources":true},"stagedDeploymentID":"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34:decad3bb6b62e0b255715bddc279cdd1d8a3ab3f66c763c5cdb8984b5be990d3","options":{"helm":{},"forceSyncGeneration":3,"keepResources":true},"deploymentID":"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34:decad3bb6b62e0b255715bddc279cdd1d8a3ab3f66c763c5cdb8984b5be990d3"},"status":{"display":{},"resourceCounts":{"ready":0,"desiredReady":0,"waitApplied":0,"modified":0,"orphaned":0,"missing":0,"unknown":0,"notReady":0}}},"deploymentID":"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34:decad3bb6b62e0b255715bddc279cdd1d8a3ab3f66c763c5cdb8984b5be990d3","operation":"updated"}
{"level":"error","ts":"2025-12-12T15:21:09Z","msg":"Reconciler error","controller":"bundle","controllerGroup":"fleet.cattle.io","controllerKind":"Bundle","Bundle":{"name":"my-scripts-cl2-new-dir-2","namespace":"fleet-default"},"namespace":"fleet-default","name":"my-scripts-cl2-new-dir-2","reconcileID":"d8883d2c-8576-4a1e-8832-e0a865ffd6ea","error":"failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-0b1bb5471d748dda0f2b9b7e8ea5f7fbfb647d23f9ad9b16257d249c8ac34\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:353\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202"}

The error that the UI shows after trying to update is:

NotReady(1) [Cluster fleet-default/load-cluster-01: not installed: Unable to continue with install: Deployment "cattle-cluster-agent" in namespace "cattle-system" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "my-scripts-cl2-new-dir-2"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "default"]

What I am changing on the resource yaml:

  • replicas from 2 -> 6
  • and these annotations/labels values that the UI is saying is wrong.
Image

marytlf avatar Dec 12 '25 15:12 marytlf

@kkaempf , do you need more logs? what log and where should we collect the log from? We have a production server on rancher 2.12.3 (fleet 0.13.4) where the problem has reappeared, that some of the bundledeployment has stopped being synced from the bundle 4 days ago. We need to apply the workaround again to have it resumed for production, but we can collect some logs if it happens again next time.

In the meantime, we will try to upgrade a dev server from v2.12.3 to v2.13.0 if that helps your troubleshooting

aDisplayName avatar Dec 16 '25 18:12 aDisplayName

I think this can be related to: https://github.com/rancher/fleet/issues/4458

susesamu avatar Dec 17 '25 17:12 susesamu

@kkaempf ,

Here is a log from cattle-fleet-system/fleet-controller, after upgrading from rancher 2.12.3 to 2.13.0 (fleet: 0.13.4 → 0.14.0)

{
  "level": "info",
  "ts": "2025-12-17T22:45:46Z",
  "logger": "bundle",
  "msg": "failed to create a bundledeployment, skipping and requeuing: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found",
  "controller": "bundle",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "Bundle",
  "Bundle": {
    "name": "edge-catalog-uiat-1-app-catalog-ci",
    "namespace": "fleet-default"
  },
  "namespace": "fleet-default",
  "name": "edge-catalog-uiat-1-app-catalog-ci",
  "reconcileID": "28e0cd4e-e85d-44fd-b1d4-cd4952d8986b",
  "gitrepo": "edge-catalog-uiat-1",
  "commit": "13f0314a5c30fd3540800bec326557bbfd267822",
  "manifestID": "s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5",
  "bundledeployment": "edge-catalog-uiat-1-app-catalog-ci"
}

{
  "level": "error",
  "ts": "2025-12-17T22:45:46Z",
  "msg": "Reconciler error",
  "controller": "bundle",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "Bundle",
  "Bundle": {
    "name": "edge-catalog-uiat-1-app-catalog-ci",
    "namespace": "fleet-default"
  },
  "namespace": "fleet-default",
  "name": "edge-catalog-uiat-1-app-catalog-ci",
  "reconcileID": "28e0cd4e-e85d-44fd-b1d4-cd4952d8986b",
  "error": "failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found",
  "errorCauses": [
    {
      "error": "failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found"
    },
    {
      "error": "failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found"
    },
    {
      "error": "failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found"
    },
    {
      "error": "failed to create bundle deployment: failed to get content resource: Content.fleet.cattle.io \"s-a3b7c30971c30362772036db4c8275eaa7e3ce5c1cdafc8aa0a061ad2aaa5\" not found"
    },
    // a lot of same lines, one per cluster possibly. 
  ],
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:474\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"
}

I also have attached a full log related to the bundle edge-catalog-uiat-1-app-catalog-ci here

fleet-controller-676bcc6b85-bqp48_fleet-controller (1) - filtered.zip

aDisplayName avatar Dec 17 '25 22:12 aDisplayName

@aDisplayName Those logs attached seem to be truncated and don't include the initial Bundle reconciliation. Do you have the full logs?

0xavi0 avatar Dec 18 '25 13:12 0xavi0

If there's an error on the yaml(published on git) <-> rancher (eg: yaml have different resourceVersion or different type of field or wrong format) at some point, then the bundle stop trying to update.

@marytlf Could you please share one of those yaml files? (or any example yaml that does trigger the error)

0xavi0 avatar Dec 18 '25 13:12 0xavi0

@0xavi0 sure, you can get it from here and here they're the same type of deployment I just use in two different downstream clusters (both fails)

marytlf avatar Dec 18 '25 13:12 marytlf

Thanks @marytlf !

I managed to recreate the issue in v0.14. I think the internal cache gets out of sync for still unknown reasons, but can I ask if that's a real Deployment? You are deploying the rancher-agent which is managed by Rancher and Helm, that's why you get the errors shown in the UI.

While I don't think that's supported, I'm still curious why Fleet tries to get Content resources that are outdated.

0xavi0 avatar Dec 18 '25 15:12 0xavi0

@aDisplayName Those logs attached seem to be truncated and don't include the initial Bundle reconciliation. Do you have the full logs?

No, I could not get earlier log. I'll create a new bundle and let it run for a few days

aDisplayName avatar Dec 18 '25 16:12 aDisplayName

I think the Content errors are expected and should be temporary. The Bundle controller creates the Content resource and, right after that, creates the BundleDeployments that use that Content. The Bundle controller tries to add a finalizer for each BundleDeployment in the Content resource, and sometimes the Content resource is still not fully created in the cluster.

As I said, that should be a temporary error and as soon as the Content resource is fully available the BundleDeployments should be created without any problem.

As I mentioned earlier: you can't deploy stuff that was already deployed and owned by Helm.

0xavi0 avatar Dec 18 '25 16:12 0xavi0

Here is a new set of the fleet-controller log. I've filtered out the logs not related to the bundle we were having the problem.

filtered-20251219-095451_sorted_ts.zip

Some key timestamp (All times on UTC)

  • Remove the bundle (edge-catalog-uiat-2-app-catalog-ci) and the gitrepo (edge-catalog-uiat-2) it belongs at 2025-12-18 23:40 (log not included)
  • Adding a new gitrepo (edge-catalog-uiat-3) at 2025-12-18T23:44Z.
  • bundledeployments creation process started based on commit 33f8f4b19a8b83734b89541afc716712bc20a57f at 2025-12-18T23:44:45.669979769Z
  • bundle deployments creation started to succeeds from 2025-12-18T23:44:45.788225815Z
  • New commit 6a64353a41415b74b1967b7dea2e443eb99d9b3a submitted to source code git repository at 2025-12-19T00:22
  • New commit picked up by gitjob and fleet controller.
    {
      "payload": {
        "Bundle": {
          "name": "edge-catalog-uiat-3-app-catalog-ci",
          "namespace": "fleet-default"
        },
        "commit": "6a64353a41415b74b1967b7dea2e443eb99d9b3a",
        "controller": "bundle",
        "controllerGroup": "fleet.cattle.io",
        "controllerKind": "Bundle",
        "gitrepo": "edge-catalog-uiat-3",
        "level": "info",
        "logger": "bundle",
        "msg": "requeue event, retrying since bundle values secret has changed, expected hash \"ac8f8395646effd07496077f6edcfba05a7a0a2867384dbd0f2c65903ccaafd8\", calculated \"5f24995e180d0919f33088a8b17fd55943a4682e9113fafabeed2569d34342a6\"",
        "name": "edge-catalog-uiat-3-app-catalog-ci",
        "namespace": "fleet-default",
        "reconcileID": "347f9766-c876-4451-acc9-d06171055710",
        "ts": "2025-12-19T00:22:26Z"
      },
      "ts": "2025-12-19T00:22:26.995598932Z"
    }
    
  • All bundledeployments update failed since then

@kkaempf @0xavi0

  • Rancher Server: v2.13.0
  • Helm: v3.19.0-rancher1
  • Machine: v0.15.0-rancher137
  • fleet-controller: 0.14.0
  • Downstream fleet-agent: 0.14.0

aDisplayName avatar Dec 19 '25 18:12 aDisplayName

One more comments, in our rancher setup, there are 175 imported clusters, but only 5-6 are connected once per day for a few hours during working hour (UTC-6). The rest clusters are always in "Unavailable" or "Pending" status.

aDisplayName avatar Dec 19 '25 18:12 aDisplayName