fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Multiple fleet agents created when deploying Rancher 2.12-head

Open mmartin24 opened this issue 9 months ago • 3 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

Issue

Multiple fleet agents are deployed when deploying Rancher v2.12-b55fb8c0a326d73fbb3dec30d39d9842901c4db4-head and Fleet fleet:106.0.0+up0.12.0

Easily reproduceable installing latest Rancher version with Fleet

Steps to reproduce

Deploy k3d and latest Rancher. I used this

Click to expand installation steps
# Go to main branc on fleet repo
# k3d prep
./dev/build-fleet 
./dev/setup-k3d
k3d image import rancher/fleet-agent:dev rancher/fleet:dev -m direct -c upstream

# Cert Manager installation
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm upgrade --install cert-manager --namespace cert-manager jetstack/cert-manager \
  --create-namespace \
  --set installCRDs=true \
  --set "extraArgs[0]=--enable-certificate-owner-ref=true" \
  --wait
sleep 8

# Rancher install
export MY_IP='' # Add local IP here
export SYSTEM_DOMAIN="${MY_IP}.nip.io"
export RANCHER_USER=admin RANCHER_PASSWORD=password
export RANCHER_URL=https://${MY_IP}.nip.io/dashboard

helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo update

helm upgrade --install rancher rancher-latest/rancher \
--devel \
--set "rancherImageTag=head" \
--namespace cattle-system --create-namespace \
--set "extraEnv[1].name=CATTLE_AGENT_IMAGE" \
--set "extraEnv[0].name=CATTLE_SERVER_URL" \
--set hostname=$SYSTEM_DOMAIN \
--set bootstrapPassword=password \
--set replicas=1 \
--set agentTLSMode=system-store \
\
--set "extraEnv[1].value=rancher/rancher-agent:head" \
\
--wait

Observe how multiple agents are created from start.

> k get pods -A
NAMESPACE                         NAME                                        READY   STATUS      RESTARTS   AGE
cattle-fleet-local-system         fleet-agent-57c4fd678d-scqqj                1/1     Running     0          111m
cattle-fleet-local-system         fleet-agent-5ccb446795-6nq6j                1/1     Running     0          111m
cattle-fleet-local-system         fleet-agent-6976b68ccd-lh7jp                1/1     Running     0          111m
cattle-fleet-local-system         fleet-agent-6f7cb88cd5-dtn42                1/1     Running     0          111m
cattle-fleet-local-system         fleet-agent-7c9fbc5ddb-5wmql                1/1     Running     0          111m
cattle-fleet-system               fleet-cleanup-clusterregistrations-p489j    0/1     Completed   0          111m
cattle-fleet-system               fleet-cleanup-clusterregistrations-qdz7h    0/1     Completed   0          111m
cattle-fleet-system               fleet-cleanup-clusterregistrations-r72z2    0/1     Completed   0          111m
cattle-fleet-system               fleet-cleanup-clusterregistrations-smzqx    0/1     Completed   0          111m
cattle-fleet-system               fleet-controller-64dcdbf9f8-rcdth           3/3     Running     0          112m
cattle-fleet-system               gitjob-79d965746b-vw5fx                     1/1     Running     0          112m
cattle-provisioning-capi-system   capi-controller-manager-5fcbfb9f95-5mtxw    1/1     Running     0          110m
cattle-provisioning-capi-system   rancher-provisioning-capi-patch-sa-45jnw    0/1     Completed   0          110m
cattle-system                     dashboard-shell-pbnxs                       2/2     Running     0          5m28s
cattle-system                     helm-operation-5pzph                        0/2     Completed   0          110m
cattle-system                     helm-operation-95cqr                        0/2     Completed   0          111m
cattle-system                     helm-operation-9vms4                        0/2     Completed   0          111m
cattle-system                     helm-operation-9xc5q                        0/2     Completed   0          111m
cattle-system                     helm-operation-g6glg                        0/2     Completed   0          110m
cattle-system                     helm-operation-gvkd6                        0/2     Completed   0          110m
cattle-system                     helm-operation-hh5hp                        0/2     Completed   0          110m
cattle-system                     helm-operation-hnr9s                        0/2     Completed   0          112m
cattle-system                     helm-operation-jhmvd                        0/2     Completed   0          112m
cattle-system                     helm-operation-lqvqb                        0/2     Completed   0          111m
cattle-system                     helm-operation-rrnh5                        0/2     Completed   0          110m
cattle-system                     helm-operation-wgtpz                        0/2     Completed   0          110m
cattle-system                     rancher-8659474b69-dn4d5                    1/1     Running     0          113m
cattle-system                     rancher-webhook-6454557c9f-9jff9            1/1     Running     0          111m
cattle-system                     system-upgrade-controller-d45b67dc9-ksm9h   1/1     Running     0          110m
cert-manager                      cert-manager-8576d99cc8-jbbkt               1/1     Running     0          113m
cert-manager                      cert-manager-cainjector-664b5878d6-nml22    1/1     Running     0          113m
cert-manager                      cert-manager-webhook-6ddb7bd6c5-2rvrj       1/1     Running     0          113m
kube-system                       coredns-7f6545b9bb-6pb5x                    1/1     Running     0          114m
kube-system                       helm-install-traefik-crd-tqjn4              0/1     Completed   0          114m
kube-system                       helm-install-traefik-jqpj7                  0/1     Completed   2          114m
kube-system                       local-path-provisioner-595dcfc56f-xptst     1/1     Running     0          114m
kube-system                       metrics-server-cdcc87586-twjnw              1/1     Running     0          114m
kube-system                       svclb-traefik-e3617378-6j8bg                2/2     Running     0          114m
kube-system                       traefik-d7c9c5778-4hqvw                     1/1     Running     0          114m

Note: checked in Rancher 2.11.0-rc8 and it is working ok there

Image

Expected Behavior

Only 1 fleet agent should apper

Steps To Reproduce

No response

Environment

- Architecture: amd64
- Fleet Version:fleet:106.0.0+up0.12.0
- Cluster:
  - Provider: k3d
  - Options:
  - Kubernetes Version:`v1.30.8 +k3s1`

Logs


Anything else?

No response

mmartin24 avatar Mar 27 '25 11:03 mmartin24

I think the garbage collector is broken? That could explain why the garbage collector is not cleaning up the resources sets.

% docker logs k3d-upstream-server-0
W0401 10:38:37.276688      39 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: "system:kube-controller-manager": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
E0401 10:38:37.276734      39 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: \"system:kube-controller-manager\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"
E0401 10:38:40.355476      39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=525&watch=true" auditID="92eb45ba-95de-4064-b1f5-610be559f68b"
E0401 10:38:45.356217      39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=403&watch=true" auditID="d8623bbf-6f88-4698-a43e-79ed1b8be877"
E0401 10:38:46.082556      39 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated]"
E0401 10:38:50.074991      39 shared_informer.go:316] "Unhandled Error" err="unable to sync caches for garbage collector"
E0401 10:38:50.076047      39 garbagecollector.go:268] "Unhandled Error" err="timed out waiting for dependency graph builder sync during GC sync (attempt 158)"
I0401 10:38:50.178660      39 shared_informer.go:313] Waiting for caches to sync for garbage collector

The same happens to other deployments, e.g. a simple nginx deployemnt that is created and deleted with kubectl a few times.

As soon as I scale down the Rancher controller to 0 replicas, k8s garbage collection cleans up the pods/replicasets. That makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?

manno avatar Apr 01 '25 10:04 manno

I think the garbage collector is broken? That could explain why the garbage collector is not cleaning up the resources sets.

% docker logs k3d-upstream-server-0
W0401 10:38:37.276688      39 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: "system:kube-controller-manager": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
E0401 10:38:37.276734      39 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: \"system:kube-controller-manager\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"
E0401 10:38:40.355476      39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=525&watch=true" auditID="92eb45ba-95de-4064-b1f5-610be559f68b"
E0401 10:38:45.356217      39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=403&watch=true" auditID="d8623bbf-6f88-4698-a43e-79ed1b8be877"
E0401 10:38:46.082556      39 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated]"
E0401 10:38:50.074991      39 shared_informer.go:316] "Unhandled Error" err="unable to sync caches for garbage collector"
E0401 10:38:50.076047      39 garbagecollector.go:268] "Unhandled Error" err="timed out waiting for dependency graph builder sync during GC sync (attempt 158)"
I0401 10:38:50.178660      39 shared_informer.go:313] Waiting for caches to sync for garbage collector

The same happens to other deployments, e.g. a simple nginx deployemnt that is created and deleted with kubectl a few times.

As soon as I scale down the Rancher controller to 0 replicas, k8s garbage collection cleans up the pods/replicasets. That makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?

Commented offline. Maybe related to: https://github.com/rancher/rancher/pull/49616 https://github.com/kubernetes/kubernetes/pull/125796

mmartin24 avatar Apr 01 '25 11:04 mmartin24

makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?

The invalid label is the system:kube-controller-manager username from a list request getting through into a label selector used to query the backend storage, causing a kube error there.

andreas-kupries avatar Apr 01 '25 12:04 andreas-kupries