Multiple fleet agents created when deploying Rancher 2.12-head
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
Issue
Multiple fleet agents are deployed when deploying Rancher v2.12-b55fb8c0a326d73fbb3dec30d39d9842901c4db4-head and Fleet fleet:106.0.0+up0.12.0
Easily reproduceable installing latest Rancher version with Fleet
Steps to reproduce
Deploy k3d and latest Rancher. I used this
Click to expand installation steps
# Go to main branc on fleet repo
# k3d prep
./dev/build-fleet
./dev/setup-k3d
k3d image import rancher/fleet-agent:dev rancher/fleet:dev -m direct -c upstream
# Cert Manager installation
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm upgrade --install cert-manager --namespace cert-manager jetstack/cert-manager \
--create-namespace \
--set installCRDs=true \
--set "extraArgs[0]=--enable-certificate-owner-ref=true" \
--wait
sleep 8
# Rancher install
export MY_IP='' # Add local IP here
export SYSTEM_DOMAIN="${MY_IP}.nip.io"
export RANCHER_USER=admin RANCHER_PASSWORD=password
export RANCHER_URL=https://${MY_IP}.nip.io/dashboard
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo update
helm upgrade --install rancher rancher-latest/rancher \
--devel \
--set "rancherImageTag=head" \
--namespace cattle-system --create-namespace \
--set "extraEnv[1].name=CATTLE_AGENT_IMAGE" \
--set "extraEnv[0].name=CATTLE_SERVER_URL" \
--set hostname=$SYSTEM_DOMAIN \
--set bootstrapPassword=password \
--set replicas=1 \
--set agentTLSMode=system-store \
\
--set "extraEnv[1].value=rancher/rancher-agent:head" \
\
--wait
Observe how multiple agents are created from start.
> k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-local-system fleet-agent-57c4fd678d-scqqj 1/1 Running 0 111m
cattle-fleet-local-system fleet-agent-5ccb446795-6nq6j 1/1 Running 0 111m
cattle-fleet-local-system fleet-agent-6976b68ccd-lh7jp 1/1 Running 0 111m
cattle-fleet-local-system fleet-agent-6f7cb88cd5-dtn42 1/1 Running 0 111m
cattle-fleet-local-system fleet-agent-7c9fbc5ddb-5wmql 1/1 Running 0 111m
cattle-fleet-system fleet-cleanup-clusterregistrations-p489j 0/1 Completed 0 111m
cattle-fleet-system fleet-cleanup-clusterregistrations-qdz7h 0/1 Completed 0 111m
cattle-fleet-system fleet-cleanup-clusterregistrations-r72z2 0/1 Completed 0 111m
cattle-fleet-system fleet-cleanup-clusterregistrations-smzqx 0/1 Completed 0 111m
cattle-fleet-system fleet-controller-64dcdbf9f8-rcdth 3/3 Running 0 112m
cattle-fleet-system gitjob-79d965746b-vw5fx 1/1 Running 0 112m
cattle-provisioning-capi-system capi-controller-manager-5fcbfb9f95-5mtxw 1/1 Running 0 110m
cattle-provisioning-capi-system rancher-provisioning-capi-patch-sa-45jnw 0/1 Completed 0 110m
cattle-system dashboard-shell-pbnxs 2/2 Running 0 5m28s
cattle-system helm-operation-5pzph 0/2 Completed 0 110m
cattle-system helm-operation-95cqr 0/2 Completed 0 111m
cattle-system helm-operation-9vms4 0/2 Completed 0 111m
cattle-system helm-operation-9xc5q 0/2 Completed 0 111m
cattle-system helm-operation-g6glg 0/2 Completed 0 110m
cattle-system helm-operation-gvkd6 0/2 Completed 0 110m
cattle-system helm-operation-hh5hp 0/2 Completed 0 110m
cattle-system helm-operation-hnr9s 0/2 Completed 0 112m
cattle-system helm-operation-jhmvd 0/2 Completed 0 112m
cattle-system helm-operation-lqvqb 0/2 Completed 0 111m
cattle-system helm-operation-rrnh5 0/2 Completed 0 110m
cattle-system helm-operation-wgtpz 0/2 Completed 0 110m
cattle-system rancher-8659474b69-dn4d5 1/1 Running 0 113m
cattle-system rancher-webhook-6454557c9f-9jff9 1/1 Running 0 111m
cattle-system system-upgrade-controller-d45b67dc9-ksm9h 1/1 Running 0 110m
cert-manager cert-manager-8576d99cc8-jbbkt 1/1 Running 0 113m
cert-manager cert-manager-cainjector-664b5878d6-nml22 1/1 Running 0 113m
cert-manager cert-manager-webhook-6ddb7bd6c5-2rvrj 1/1 Running 0 113m
kube-system coredns-7f6545b9bb-6pb5x 1/1 Running 0 114m
kube-system helm-install-traefik-crd-tqjn4 0/1 Completed 0 114m
kube-system helm-install-traefik-jqpj7 0/1 Completed 2 114m
kube-system local-path-provisioner-595dcfc56f-xptst 1/1 Running 0 114m
kube-system metrics-server-cdcc87586-twjnw 1/1 Running 0 114m
kube-system svclb-traefik-e3617378-6j8bg 2/2 Running 0 114m
kube-system traefik-d7c9c5778-4hqvw 1/1 Running 0 114m
Note: checked in Rancher 2.11.0-rc8 and it is working ok there
Expected Behavior
Only 1 fleet agent should apper
Steps To Reproduce
No response
Environment
- Architecture: amd64
- Fleet Version:fleet:106.0.0+up0.12.0
- Cluster:
- Provider: k3d
- Options:
- Kubernetes Version:`v1.30.8 +k3s1`
Logs
Anything else?
No response
I think the garbage collector is broken? That could explain why the garbage collector is not cleaning up the resources sets.
% docker logs k3d-upstream-server-0
W0401 10:38:37.276688 39 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: "system:kube-controller-manager": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
E0401 10:38:37.276734 39 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: \"system:kube-controller-manager\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"
E0401 10:38:40.355476 39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=525&watch=true" auditID="92eb45ba-95de-4064-b1f5-610be559f68b"
E0401 10:38:45.356217 39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=403&watch=true" auditID="d8623bbf-6f88-4698-a43e-79ed1b8be877"
E0401 10:38:46.082556 39 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated]"
E0401 10:38:50.074991 39 shared_informer.go:316] "Unhandled Error" err="unable to sync caches for garbage collector"
E0401 10:38:50.076047 39 garbagecollector.go:268] "Unhandled Error" err="timed out waiting for dependency graph builder sync during GC sync (attempt 158)"
I0401 10:38:50.178660 39 shared_informer.go:313] Waiting for caches to sync for garbage collector
The same happens to other deployments, e.g. a simple nginx deployemnt that is created and deleted with kubectl a few times.
As soon as I scale down the Rancher controller to 0 replicas, k8s garbage collection cleans up the pods/replicasets. That makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?
I think the garbage collector is broken? That could explain why the garbage collector is not cleaning up the resources sets.
% docker logs k3d-upstream-server-0 W0401 10:38:37.276688 39 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: "system:kube-controller-manager": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?') E0401 10:38:37.276734 39 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Internal error occurred: failed to list tokens: unable to parse requirement: values[0][authn.management.cattle.io/token-userId]: Invalid value: \"system:kube-controller-manager\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')" E0401 10:38:40.355476 39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=525&watch=true" auditID="92eb45ba-95de-4064-b1f5-610be559f68b" E0401 10:38:45.356217 39 wrap.go:53] "Timeout or abort while handling" method="GET" URI="/apis/ext.cattle.io/v1/tokens?allowWatchBookmarks=true&resourceVersion=54460&timeout=5s&timeoutSeconds=403&watch=true" auditID="d8623bbf-6f88-4698-a43e-79ed1b8be877" E0401 10:38:46.082556 39 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, Token has been invalidated]" E0401 10:38:50.074991 39 shared_informer.go:316] "Unhandled Error" err="unable to sync caches for garbage collector" E0401 10:38:50.076047 39 garbagecollector.go:268] "Unhandled Error" err="timed out waiting for dependency graph builder sync during GC sync (attempt 158)" I0401 10:38:50.178660 39 shared_informer.go:313] Waiting for caches to sync for garbage collectorThe same happens to other deployments, e.g. a simple nginx deployemnt that is created and deleted with kubectl a few times.
As soon as I scale down the Rancher controller to 0 replicas, k8s garbage collection cleans up the pods/replicasets. That makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?
Commented offline. Maybe related to: https://github.com/rancher/rancher/pull/49616 https://github.com/kubernetes/kubernetes/pull/125796
makes sense as Rancher Head has a new tokens CRD, which seems to have an invalid label?
The invalid label is the system:kube-controller-manager username from a list request getting through into a label selector used to query the backend storage, causing a kube error there.