fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Fleet agent installation is stuck

Open manno opened this issue 3 months ago • 3 comments

The fleet agent installation/update gets stuck. This happened when the server was under heavy load.

Workaround: Delete the agent's helm secrets and restart it or force redeploy it.

{"level":"info","ts":"2025-09-16T14:04:31Z","logger":"bundledeployment.helm-deployer.install","msg":"Installing helm release","controller":"bundledeployment","controllerGroup":"fleet.cattle.io","controllerKind":"BundleDeployment","BundleDeployment":{"name":"fleet-agent-d0-k3k-downstream001-downstream0045","namespace":"cluster-fleet-default-d0-k3k-downstream001-downstream0045-f8390"},"namespace":"cluster-fleet-default-d0-k3k-downstream001-downstream0045-f8390","name":"fleet-agent-d0-k3k-downstream001-downstream0045","reconcileID":"fd2484ac-45c5-46fb-bc4b-0d76434786e8","commit":"","dryRun":false}                                                                                                   
{"level":"error","ts":"2025-09-16T14:04:31Z","msg":"Reconciler error","controller":"bundledeployment","controllerGroup":"fleet.cattle.io","controllerKind":"BundleDeployment","BundleDeployment":{"name":"fleet-agent-d0-k3k-downstream001-downstream0045","namespace":"cluster-fleet-default-d0-k3k-downstream001-downstream0045-f8390"},"namespace":"cluster-fleet-default-d0-k3k-downstream001-downstream0045-f8390","name":"fleet-agent-d0-k3k-downstream001-downstream0045","reconcileID":"fd2484ac-45c5-46fb-bc4b-0d76434786e8","error":"failed deploying bundle: cannot re-use a name that is still in use","errorCauses":[{"error":"failed deploying bundle: cannot re-use a name that is still in use"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:353\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202"}                                                                                                

I can see helm list is in pending-upgrade:

# helm list -n cattle-fleet-system --all
NAME                                           	NAMESPACE          	REVISION	UPDATED                                	STATUS         	CHART                                                                                                                 	APP VERSION
fleet-agent-d0-k3k-downstream001-downstream0045	cattle-fleet-system	2       	2025-09-16 13:23:38.061597057 +0000 UTC	pending-upgrade	fleet-agent-d0-k3k-downstream001-downstream0045-v0.0.0+s-c80a5ad64879318fcc81259f14f38c4df7ae3e9900d7e19b9633f67c24e6f

We fixed pending-install in https://github.com/rancher/fleet/pull/4065/files

manno avatar Sep 17 '25 10:09 manno

Testing with a version that has https://github.com/manno/fleet/commit/e90365d487dbe4fbff4d4db430d63623c1d5f4cd

I see the fleet-agent is deployed (manager initiated registration) and can handle bundledeployments, i.e. it deployed 50 workload bundles. However it can't deploy its own helm chart and adopt itself. Cluster is stuck at 50/51 bundles.

{
  "level": "info",
  "ts": "2025-09-24T15:28:05Z",
  "logger": "bundledeployment.helm-deployer.install",
  "msg": "Installing helm release",
  "controller": "bundledeployment",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "BundleDeployment",
  "BundleDeployment": {
    "name": "fleet-agent-d0-k3k-downstream018-downstream1727",
    "namespace": "cluster-fleet-default-d0-k3k-downstream018-downstream1727-718b0"
  },
  "namespace": "cluster-fleet-default-d0-k3k-downstream018-downstream1727-718b0",
  "name": "fleet-agent-d0-k3k-downstream018-downstream1727",
  "reconcileID": "38a8012a-1f5c-457e-b62e-86b061eebe0f",
  "commit": "",
  "dryRun": false
}
{
  "level": "error",
  "ts": "2025-09-24T15:28:05Z",
  "msg": "Reconciler error",
  "controller": "bundledeployment",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "BundleDeployment",
  "BundleDeployment": {
    "name": "fleet-agent-d0-k3k-downstream018-downstream1727",
    "namespace": "cluster-fleet-default-d0-k3k-downstream018-downstream1727-718b0"
  },
  "namespace": "cluster-fleet-default-d0-k3k-downstream018-downstream1727-718b0",
  "name": "fleet-agent-d0-k3k-downstream018-downstream1727",
  "reconcileID": "38a8012a-1f5c-457e-b62e-86b061eebe0f",
  "error": "failed deploying bundle: cannot re-use a name that is still in use",
  "errorCauses": [
    {
      "error": "failed deploying bundle: cannot re-use a name that is still in use"
    }
  ],
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:474\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"
}

After increasing redeployAgentGeneration:

% helm status -n cattle-fleet-system  fleet-agent-d0-k3k-downstream018-downstream1727
NAME: fleet-agent-d0-k3k-downstream018-downstream1727
LAST DEPLOYED: Wed Sep 24 14:45:20 2025
NAMESPACE: cattle-fleet-system
STATUS: superseded
REVISION: 2
TEST SUITE: None

% helm history -n cattle-fleet-system  fleet-agent-d0-k3k-downstream018-downstream1727
REVISION	UPDATED                 	STATUS    	CHART                                                                                                                 	APP VERSION	DESCRIPTION
2       	Wed Sep 24 14:45:20 2025	superseded	fleet-agent-d0-k3k-downstream018-downstream1727-v0.0.0+s-a394e34ef708418c7ef06c155ff7862261616418d1c7e434c999e0080455f	           	Upgrade complete

Cluster status shows ErrApplied(1) [Bundle fleet-agent-d0-k3k-downstream018-downstream1727: cannot re-use a name that is still in use]'.

manno avatar Sep 24 '25 15:09 manno

This also happened to me when upgrading rancher from 2.11.3 to 2.12.0

0xavi0 avatar Oct 17 '25 08:10 0xavi0

I tried to reproduce this using single node 1.31.6 +k3s1 cluster, deploying nginx gitrepo and later upgrading via Helmfrom 2.11.2 to 2.11.3 and later to 2.12.0 and fleet agent seemed to be ok all the time.

https://github.com/user-attachments/assets/50fc201e-7ed2-4627-b319-135cf0e8fd57

Here you can see 2.11.2:

Image

Later 2.11.3 (fleet-agent looks good)

Image

And finally to 2.12.0 (fleet-agent looks good)

Image

Not sure if perhaps I am missing something here.

mmartin24 avatar Nov 06 '25 15:11 mmartin24

Could not reproduce for now.

weyfonk avatar Nov 27 '25 13:11 weyfonk