Fleet: Bundles fail to deploy to target cluster for ~8 minutes due to TLS handshake error
Setup
- Rancher version: v2.12
- Install type: Docker
Describe the bug Bundles fail to deploy to the target cluster for ~8 minutes after Rancher is brought up via Docker.
To Reproduce
- Start Rancher using the official Docker method
- Go to Continuous Delivery > Bundles
- Create a Bundle with the following manifest:
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
name: test
namespace: fleet-local
# annotations: key: string
# labels: key: string
spec:
targets:
- clusterName: local
clusterSelector:
matchExpressions:
- key: fleet.cattle.io/non-managed-agent
operator: DoesNotExist
ignore: {}
- Observe the bundle's status
Result
- Bundle status remains Wait Applied
- Deployments show 0/1
- No errors are visible in the UI
Expected Result
- Bundle should deploy successfully to the local cluster
- Status should progress to Active
- Deployments show 1/1
Screenshots
Additional context UPDATE: This issue is not related to Fleet itself. Investigation by the Fleet team confirms the delay is caused by TLS cert readiness in the Rancher backend, specifically when installed via Docker.
Logs shows that dynamic listener certs are not ready for several minutes after startup, blocking Fleet from applying bundles.
No longer able to reproduce this issue locally; however, E2E tests are still failing in CI. I suspect this may be due to bundle deployment timing in the CI environment. I tested with increased timeouts in Cypress, but as shown in the screenshot below, the fleet-agent-local bundle remained in the WaitApplied state even after 5 minutes.
UPDATE: This appears to be an intermittent issue triggered when Rancher has the Cluster Roles aggregation feature is enabled (aggregated-roletemplates=true).
https://github.com/user-attachments/assets/24f6b7d5-afad-43b9-8313-8e9f6c9f88b1
errors seen in the logs:
-
error: failed to get content resource: Content.fleet.cattle.io "s-f43d..." not found -
error: Operation cannot be fulfilled on bundles.fleet.cattle.io \"fleet-agent-local\": StorageError: invalid object
@yonasberhe23 I think we need helm from Fleet team for this, it looks to me a backend issue @manno
Tested and could not reproduce v2.12-62362dfd905a75f77865a47aff683a195cc6ac66-head with Fleet fleet:107.0.0+up0.13.0-alpha.3 and not observable:
Note: we tested with custom flags and when tried to add aggregated-roletemplates flag we noticed ad lock and we cannot modify this one. Not sure if this expected.
With the help of @torchiaf , I managed to deploy Rancher with aggregated-roletemplates=true and it worked for me:
My setup (it contains other flags for other tests, not relevant here):
helm upgrade --install rancher rancher-latest/rancher \
--devel \
--set rancherImageTag=head \
--namespace cattle-system --create-namespace \
--set "extraEnv[0].name=CATTLE_SERVER_URL" \
--set 'extraEnv[1].name=CATTLE_AGENT_IMAGE' \
--set hostname=$SYSTEM_DOMAIN \
--set bootstrapPassword=password \
--set replicas=1 \
--set agentTLSMode=system-store \
--set-string "fleet.extraEnv[0].name=EXPERIMENTAL_HELM_OPS" \
--set 'extraEnv[2].name=CATTLE_FEATURES' \
--set 'extraEnv[2].value=aggregated-roletemplates=true' \
\
--set "extraEnv[1].value=rancher/rancher-agent:head" \
\
--wait
Not sure if I am missing something on the automation, but manually at least seems ok to me.
@mmartin24 thank you for confirming. I'll give it a quick test on the automation side
The test is still failing on CI and I was able to reproduce the issue locally on rancher v2.12-8024ff603044f337d9e2df5492a45893a1133295-head. @mmartin24 i'll DM my setup
Automation failure:
Local failure:
https://github.com/user-attachments/assets/9abbc310-a75b-482e-aef7-b035c4625035
The test is still failing on CI and I was able to reproduce the issue locally on rancher v2.12-8024ff603044f337d9e2df5492a45893a1133295-head. @mmartin24 i'll DM my setup
Automation failure:
Local failure:
Screen.Recording.2025-06-06.at.9.40.52.AM.mov
@yonasberhe23 , the set up sent seems no longer available. Can you pls re-share?. Also mind pls sending the specific command used to install rancher to ensure I have the same? Thanks.
@mmartin24 I was able to reproduce the issue without the aggregated-roletemplates=true parameter, so it may have nothing to do with that setting. The bundle does eventually become active, so this may point to some kind of performance degradation or delayed reconciliation rather than a configuration issue.
The attached video shows that the bundle is still not active after 7 minutes. (Note: the video does not show it becoming active, but it does transition to active after around 9 minutes.)
https://github.com/user-attachments/assets/8f7975ae-d739-498c-8857-09960fa61b6e
This is the command i used to install Rancher:
docker run -d --restart=unless-stopped \ -p 80:80 -p 443:443 \ --privileged \ rancher/rancher:v2.12-8024ff603044f337d9e2df5492a45893a1133295-head
Thanks @yonasberhe23 . I see it only happens when deploying via docker. I add a describe on the bundle and I will check with the team.
> k describe -n fleet-local bundles.fleet.cattle.io
Name: fleet-agent-local
Namespace: fleet-local
Labels: objectset.rio.cattle.io/hash=2c708c6974cf00d8ca0d5c0f1706ab63548b55d2
Annotations: objectset.rio.cattle.io/applied:
H4sIAAAAAAAA/4yTSW/bPBCG/0owZ8qR7cgLge/0JeihgFukyyXJYUSObdZcBHLkxjX03wtKTuIGTdubRHKW93lnjuCIUSMjyCOg94GRTfAp/4b6GylOxKNowkghs6WRCZdGg4S1Je...
objectset.rio.cattle.io/id: fleet-manage-agent
objectset.rio.cattle.io/owner-gvk: /v1, Kind=Namespace
objectset.rio.cattle.io/owner-name: fleet-local
objectset.rio.cattle.io/owner-namespace:
API Version: fleet.cattle.io/v1alpha1
Kind: Bundle
Metadata:
Creation Timestamp: 2025-06-10T09:48:11Z
Finalizers:
fleet.cattle.io/bundle-finalizer
Generation: 1
Owner References:
API Version: v1
Block Owner Deletion: false
Controller: false
Kind: Namespace
Name: fleet-local
UID: fe3aa9ca-5355-4263-a3ac-92711d59ccd8
Resource Version: 7583
UID: aee625c2-9a4b-47a3-8a7c-aa688265b52a
Spec:
Default Namespace: cattle-fleet-local-system
Helm:
Take Ownership: true
Ignore:
Resources:
Content: apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cattle-fleet-local-system-fleet-agent-role
rules:
- apiGroups:
- '*'
resources:
- '*'
verbs:
- '*'
- nonResourceURLs:
- '*'
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cattle-fleet-local-system-fleet-agent-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cattle-fleet-local-system-fleet-agent-role
subjects:
- kind: ServiceAccount
name: fleet-agent
namespace: cattle-fleet-local-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: fleet-agent
namespace: cattle-fleet-local-system
---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
name: default
namespace: cattle-fleet-local-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: fleet-agent
namespace: cattle-fleet-local-system
spec:
replicas: 1
selector:
matchLabels:
app: fleet-agent
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: fleet-agent
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: fleet.cattle.io/agent
operator: In
values:
- "true"
weight: 1
containers:
- command:
- fleetagent
env:
- name: BUNDLEDEPLOYMENT_RECONCILER_WORKERS
value: "50"
- name: DRIFT_RECONCILER_WORKERS
value: "50"
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: AGENT_SCOPE
value: cattle-fleet-local-system
- name: CHECKIN_INTERVAL
value: 15m0s
- name: CATTLE_ELECTION_LEASE_DURATION
value: 30s
- name: CATTLE_ELECTION_RETRY_PERIOD
value: 10s
- name: CATTLE_ELECTION_RENEW_DEADLINE
value: 25s
image: rancher/fleet-agent:v0.13.0-alpha.3
imagePullPolicy: IfNotPresent
name: fleet-agent
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /.kube
name: kube
- mountPath: /tmp
name: tmp
nodeSelector:
kubernetes.io/os: linux
securityContext:
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: fleet-agent
tolerations:
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
operator: Equal
value: "true"
- effect: NoSchedule
key: cattle.io/os
operator: Equal
value: linux
volumes:
- emptyDir: {}
name: kube
- emptyDir: {}
name: tmp
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-allow-all
namespace: cattle-fleet-local-system
spec:
egress:
- {}
ingress:
- {}
podSelector: {}
policyTypes:
- Ingress
- Egress
Name: agent.yaml
Targets:
Cluster Name: local
Cluster Selector:
Match Expressions:
Key: fleet.cattle.io/non-managed-agent
Operator: DoesNotExist
Ignore:
Status:
Conditions:
Last Update Time: 2025-06-10T09:48:11Z
Message: WaitApplied(1) [Cluster fleet-local/local]
Status: False
Type: Ready
Display:
Ready Clusters: 0/1
State: WaitApplied
Max New: 50
Max Unavailable: 1
Observed Generation: 1
Partitions:
Count: 1
Max Unavailable: 1
Name: All
Summary:
Desired Ready: 1
Non Ready Resources:
Bundle State: WaitApplied
Name: fleet-local/local
Ready: 0
Wait Applied: 1
Unavailable: 1
resourcesSha256Sum: 00e2868590836e23b55c653f1f5df8794c8f51c3cb1b987c931d4a76921212bd
Summary:
Desired Ready: 1
Non Ready Resources:
Bundle State: WaitApplied
Name: fleet-local/local
Wait Applied: 1
Unavailable: 1
Events: <none>
I see that Fleet-agent is taking almost 7-9 mins to be created by Rancher in the Docker env.
We use the helm to deploy Rancher and Fleet and we haven't seen this amount of time to create Fleet-agent pod.
I took a look with the team.
Please see these logs from docker container: docker-logs.log
There are like 8 minutes of this error...
2025/06/10 11:34:09 [ERROR] 2025/06/10 11:34:09 http: TLS handshake error from 192.168.68.112:39376: remote error: tls: unknown certificate
...until it finally gets updated:
2025/06/10 11:42:27 [INFO] certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca@1749555088,O=dynamiclistener-org: notBefore=2025-06-10 11:31:28 +0000 UTC notAfter=2026-06-10 11:42:27 +0000 UTC
2025/06/10 11:42:27 [INFO] Updating TLS secret for cattle-system/tls-rancher-internal (count: 3): map[field.cattle.io/projectId:local:p-78s82 listener.cattle.io/cn-10.43.55.25:10.43.55.25 listener.cattle.io/cn-172.17.0.2:172.17.0.2 listener.cattle.io/fingerprint:SHA1=29E89F3FF2B48E15D0D07B206993A547FFE6F731]
It seems this update is not done by Fleet:
2025/06/10 11:34:11 [INFO] Updating TLS secret for cattle-system/serving-cert (count: 6): map[[field.cattle.io/projectId:local:p-78s82](http://field.cattle.io/projectId:local:p-78s82) [listener.cattle.io/cn-127.0.0.1:127.0.0.1](http://listener.cattle.io/cn-127.0.0.1:127.0.0.1) [listener.cattle.io/cn-172.17.0.2:172.17.0.2](http://listener.cattle.io/cn-172.17.0.2:172.17.0.2) [listener.cattle.io/cn-](http://listener.cattle.io/cn-)
So whatever causes this issue in Docker, it seems not Fleet related. I hope this can help to narrow down more the problem.
Not a blocker per se but there's a big performance hit here. Let me know if Bullseye needs to help out here but I'll allow Rancher folks to dive in first.
@gaktive @yonasberhe23 should we move it to rancher/rancher ?
@torchiaf this ticket is already in r/r but I may remove this from the UI project board.
@manno is this on Fleet's radar at all?
@yonasberhe23 Slightly confused - which e2e tests are failing? Are they some of your jenkins suite, as I don't see issues with the e2e tests in PRs.
As I understand things via @aruiz14, some of dashboard's CI requires a dashboard ticket. Since this is in rancher/rancher, I can transfer this back to rancher/dashboard and see if that helps
@yonasberhe23 can you rerun E2E with this ticket now in r/d?
For reference, e2e test affected is cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts
My docker instance hit https://github.com/rancher/rancher/issues/50636, which causes a crash loop.
In theory the creation of the bundle in the test and description should never succeed, they do not have a target gh repo.
@yonasberhe23 Slightly confused - which e2e tests are failing? Are they some of your jenkins suite, as I don't see issues with the e2e tests in PRs.
@nwmac, There are two E2E tests in bundles.spec.ts that fail due to this issue. I’ve temporarily commented out the assertions, each with a note referencing this problem. These tests aren’t failing in PRs because the assertions are currently skipped. Once the TLS-related delay is resolved, I plan to uncomment those assertions. https://github.com/rancher/dashboard/blob/ac2042ca7bcab4292d90f6edcd49bbd1ebb5d3f1/cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts#L136-L137 https://github.com/rancher/dashboard/blob/ac2042ca7bcab4292d90f6edcd49bbd1ebb5d3f1/cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts#L208-L209
Thanks everybody involved, indeed this helped identifying a corner case in Fleet's agentmanagement controllers, which has a backoff mechanism that can take up to 10 minutes between retries after reaching a certain amount of failures. This controller was treating the absence of a server URL (neither from the overall config nor per-cluster settings) as a failure and kept retrying, so when there is really a valid value available, it will still be affected by the backoff:
https://github.com/rancher/fleet/issues/3838
I've submitted a PR to refactor that part and I confirmed that it will now give up on retrying under that specific case, as well as proceed with the cluster import only seconds after the server-url Setting is set in Rancher.
Nonetheless, my suggestion is still to set the CATTLE_SERVER_URL environment variable during installation (matching the behavior to Helm installations), as well as selecting a URL that is really reachable from the containers running in the embedded k3s (e.g. localhost is not a valid value), in order to the fleet-agent to be deployed successfully.
cc/ @richard-cox @yonasberhe23
I checked this issue here: https://github.com/rancher/fleet/issues/3838#issuecomment-3034938838 and to me it deploys correctly Fleet bundles and clusters within a few minutes (3/5). Please note as Alejandro mentions I did not use localhost but instead a valid value (my IP) on the URL selection. Please feel free to take a look on your side. Thanks.
All good on our end! ✅ Thanks to everyone involved in investigating and resolving this issue! 🙌
Local failure: