fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Incomplete Agent Registration Stops BundleDeployment Creation

Open manno opened this issue 6 months ago • 2 comments

I observed this error when some agents would not start properly. The cluster namespace was set on the Cluster resource, but the namespace did not exist.

The effect is that no bundledeployments will be created for clusters that follow after the incomplete one in the sort order.

To reproduce, create a Cluster resource, set Status.Namespace, but don't create it. Or delete the namespace of an existing cluster.

Workaround: Add a label to affected clusters (or any other change), that will trigger the handler again and create the namespace.

Actual error:

fleet-controller-65878c55bc-4sfvx fleet-controller {
  "level": "error",
  "ts": "2025-06-17T15:10:51Z",
  "logger": "bundle",
  "msg": "Reconcile failed to create or update bundledeployment",
  "controller": "bundle",
  "controllerGroup": "fleet.cattle.io",
  "controllerKind": "Bundle",
  "Bundle": {
    "name": "simple-simple-chart",
    "namespace": "fleet-default"
  },
  "namespace": "fleet-default",
  "name": "simple-simple-chart",
  "reconcileID": "7e3155b2-2dd1-4d68-9517-5c05ad16ffe2",
  "gitrepo": "simple",
  "commit": "990e73f981599dfa5c9a86e0cf0fab5307294f34",
  "manifestID": "s-e4322f59a9c7b048553c01f1bb419415afc9e659d0f8f65ce2c414ba60399",
  "bundledeployment": {
    "metadata": {
      "name": "simple-simple-chart",
      "namespace": "cluster-fleet-default-downstream-1002-ab4388254ea0",
      "creationTimestamp": null,
      "labels": {
        "fleet.cattle.io/bundle-name": "simple-simple-chart",
        "fleet.cattle.io/bundle-namespace": "fleet-default",
        "fleet.cattle.io/cluster": "downstream-1002",
        "fleet.cattle.io/cluster-namespace": "fleet-default",
        "fleet.cattle.io/commit": "990e73f981599dfa5c9a86e0cf0fab5307294f34",
        "fleet.cattle.io/managed": "true",
        "fleet.cattle.io/repo-name": "simple"
      },
      "finalizers": [
        "fleet.cattle.io/bundle-deployment-finalizer"
      ]
    },
    "spec": {
      "stagedOptions": {
        "namespace": "simple",
        "helm": {
          "chart": "config-chart",
          "takeOwnership": true
        },
        "ignore": {}
      },
      "stagedDeploymentID": "s-e4322f59a9c7b048553c01f1bb419415afc9e659d0f8f65ce2c414ba60399:bfd04481357ba785826df113b6dfc57fd1ca056ccb36e683e6507a0261f26d18",
      "options": {
        "namespace": "simple",
        "helm": {
          "chart": "config-chart",
          "takeOwnership": true
        },
        "ignore": {}
      },
      "deploymentID": "s-e4322f59a9c7b048553c01f1bb419415afc9e659d0f8f65ce2c414ba60399:bfd04481357ba785826df113b6dfc57fd1ca056ccb36e683e6507a0261f26d18",
      "valuesHash": "d17af820d450e45949052174f2ec303cfe46875266445fe5815eca79535ba54c"
    },
    "status": {
      "display": {},
      "resourceCounts": {
        "ready": 0,
        "desiredReady": 0,
        "waitApplied": 0,
        "modified": 0,
        "orphaned": 0,
        "missing": 0,
        "unknown": 0,
        "notReady": 0
      }
    }
  },
  "deploymentID": "s-e4322f59a9c7b048553c01f1bb419415afc9e659d0f8f65ce2c414ba60399:bfd04481357ba785826df113b6dfc57fd1ca056ccb36e683e6507a0261f26d18",
  "operation": "unchanged",
  "error": "namespaces \"cluster-fleet-default-downstream-1002-ab4388254ea0\" not found",
  "stacktrace": "github.com/rancher/fleet/internal/cmd/controller/reconciler.(*BundleReconciler).createBundleDeployment\n\t/home/runner/_work/fleet/fleet/internal/cmd/controller/reconciler/bundle_controller.go:515\ngithub.com/rancher/fleet/internal/cmd/controller/reconciler.(*BundleReconciler).Reconcile\n\t/home/runner/_work/fleet/fleet/internal/cmd/controller/reconciler/bundle_controller.go:366\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202"
}

Maybe the agentmanagement controller (cluster/controller.go -> OnClusterChange) did miss a reconcile? Though, it updated the status but failed to create the namespace.

manno avatar Jun 23 '25 08:06 manno

I think I observed this behavior in CAAPF tests once, and the assumed reason was that cleanup controller removed namespace because it didn’t find a Cluster resource in cache.

Danil-Grigorev avatar Jun 23 '25 12:06 Danil-Grigorev

I think this just happened to me on a fresh k3d cluster. Sadly I didn't capture the cleanup controllers output.

time="2025-07-04T13:13:14Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
time="2025-07-04T13:13:14Z" level=info msg="waiting on secret for service account cattle-fleet-system/fleet-controller-bootstrap"
time="2025-07-04T13:13:14Z" level=info msg="Waiting for service account token key to be populated for secret cattle-fleet-system/fleet-controller-bootstrap-token"
time="2025-07-04T13:13:16Z" level=info msg="API server config changed, trigger cluster import for cluster fleet-local/local"
time="2025-07-04T13:13:16Z" level=warning msg="cluster fleet-local/local: could not check for config changes" error="Operation cannot be fulfilled on clusters.fleet.cattle.io \"local\": the object has been modified; please apply your changes to the latest version and try again"
time="2025-07-04T13:13:16Z" level=info msg="ClusterRegistrationToken SA does not exist import-token-local-d9707885-65b9-4364-92f6-a0142ae1d624"
time="2025-07-04T13:13:16Z" level=info msg="Update agent bundle for cluster fleet-local/local"
time="2025-07-04T13:13:16Z" level=info msg="Waiting for service account token key to be populated for secret fleet-local/import-token-local-d9707885-65b9-4364-92f6-a0142ae1d624-token"
time="2025-07-04T13:13:18Z" level=info msg="Update agent bundle for cluster fleet-local/local"
time="2025-07-04T13:13:18Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2025-07-04T13:13:19Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2025-07-04T13:13:19Z" level=info msg="Update agent bundle for cluster fleet-local/local"
time="2025-07-04T13:13:19Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2025-07-04T13:13:19Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2025-07-04T13:13:19Z" level=info msg="Update agent bundle for cluster fleet-local/local"
time="2025-07-04T13:13:19Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2025-07-04T13:13:20Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:21Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-rjh5k'"
time="2025-07-04T13:13:21Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: serviceaccounts \"request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba\" is forbidden: unable to create new content in namespace cluster-fleet-local-local-1a3d67d0a899 because it is being terminated, requeuing"
time="2025-07-04T13:13:23Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: namespaces \"cluster-fleet-local-local-1a3d67d0a899\" not found, requeuing"
time="2025-07-04T13:13:25Z" level=error msg="error syncing 'fleet-local/request-rjh5k': handler cluster-registration: failed to create cluster-fleet-local-local-1a3d67d0a899/request-rjh5k-73787782-cfb9-480b-90df-82a04edb04ba /v1, Kind=ServiceAccount for cluster-registration fleet-local/request-rjh5k: namespaces \"cluster-fleet-local-local-1a3d67d0a899\" not found, requeuing"

manno avatar Jul 04 '25 13:07 manno