dashboard icon indicating copy to clipboard operation
dashboard copied to clipboard

Fleet: Bundles fail to deploy to target cluster for ~8 minutes due to TLS handshake error

Open yonasberhe23 opened this issue 9 months ago • 21 comments

Setup

  • Rancher version: v2.12
  • Install type: Docker

Describe the bug Bundles fail to deploy to the target cluster for ~8 minutes after Rancher is brought up via Docker.

To Reproduce

  1. Start Rancher using the official Docker method
  2. Go to Continuous Delivery > Bundles
  3. Create a Bundle with the following manifest:
apiVersion: fleet.cattle.io/v1alpha1
kind: Bundle
metadata:
  name: test
  namespace: fleet-local
#  annotations:  key: string
#  labels:  key: string
spec:
   targets:
    - clusterName: local
      clusterSelector:
        matchExpressions:
          - key: fleet.cattle.io/non-managed-agent
            operator: DoesNotExist
      ignore: {}
  1. Observe the bundle's status

Result

  • Bundle status remains Wait Applied
  • Deployments show 0/1
  • No errors are visible in the UI

Expected Result

  • Bundle should deploy successfully to the local cluster
  • Status should progress to Active
  • Deployments show 1/1

Screenshots

Image

Additional context UPDATE: This issue is not related to Fleet itself. Investigation by the Fleet team confirms the delay is caused by TLS cert readiness in the Rancher backend, specifically when installed via Docker.

Logs shows that dynamic listener certs are not ready for several minutes after startup, blocking Fleet from applying bundles.

yonasberhe23 avatar Apr 16 '25 22:04 yonasberhe23

No longer able to reproduce this issue locally; however, E2E tests are still failing in CI. I suspect this may be due to bundle deployment timing in the CI environment. I tested with increased timeouts in Cypress, but as shown in the screenshot below, the fleet-agent-local bundle remained in the WaitApplied state even after 5 minutes.

Image

yonasberhe23 avatar Apr 17 '25 18:04 yonasberhe23

UPDATE: This appears to be an intermittent issue triggered when Rancher has the Cluster Roles aggregation feature is enabled (aggregated-roletemplates=true).

https://github.com/user-attachments/assets/24f6b7d5-afad-43b9-8313-8e9f6c9f88b1

errors seen in the logs:

  • error: failed to get content resource: Content.fleet.cattle.io "s-f43d..." not found
  • error: Operation cannot be fulfilled on bundles.fleet.cattle.io \"fleet-agent-local\": StorageError: invalid object

yonasberhe23 avatar Apr 24 '25 22:04 yonasberhe23

@yonasberhe23 I think we need helm from Fleet team for this, it looks to me a backend issue @manno

torchiaf avatar May 11 '25 14:05 torchiaf

Tested and could not reproduce v2.12-62362dfd905a75f77865a47aff683a195cc6ac66-head with Fleet fleet:107.0.0+up0.13.0-alpha.3 and not observable:

Image

Note: we tested with custom flags and when tried to add aggregated-roletemplates flag we noticed ad lock and we cannot modify this one. Not sure if this expected.

Image

mmartin24 avatar Jun 05 '25 11:06 mmartin24

With the help of @torchiaf , I managed to deploy Rancher with aggregated-roletemplates=true and it worked for me:

Image

My setup (it contains other flags for other tests, not relevant here):

helm upgrade --install rancher rancher-latest/rancher \
  --devel \
  --set rancherImageTag=head \
  --namespace cattle-system --create-namespace \
  --set "extraEnv[0].name=CATTLE_SERVER_URL" \
  --set 'extraEnv[1].name=CATTLE_AGENT_IMAGE' \
  --set hostname=$SYSTEM_DOMAIN \
  --set bootstrapPassword=password \
  --set replicas=1 \
  --set agentTLSMode=system-store \
  --set-string "fleet.extraEnv[0].name=EXPERIMENTAL_HELM_OPS" \
  --set 'extraEnv[2].name=CATTLE_FEATURES' \
  --set 'extraEnv[2].value=aggregated-roletemplates=true' \
  \
  --set "extraEnv[1].value=rancher/rancher-agent:head" \
  \
  --wait

Not sure if I am missing something on the automation, but manually at least seems ok to me.

mmartin24 avatar Jun 05 '25 15:06 mmartin24

@mmartin24 thank you for confirming. I'll give it a quick test on the automation side

yonasberhe23 avatar Jun 05 '25 16:06 yonasberhe23

The test is still failing on CI and I was able to reproduce the issue locally on rancher v2.12-8024ff603044f337d9e2df5492a45893a1133295-head. @mmartin24 i'll DM my setup

Automation failure:

Image

Local failure:

https://github.com/user-attachments/assets/9abbc310-a75b-482e-aef7-b035c4625035

yonasberhe23 avatar Jun 06 '25 16:06 yonasberhe23

The test is still failing on CI and I was able to reproduce the issue locally on rancher v2.12-8024ff603044f337d9e2df5492a45893a1133295-head. @mmartin24 i'll DM my setup

Automation failure:

Image Local failure:

Screen.Recording.2025-06-06.at.9.40.52.AM.mov

@yonasberhe23 , the set up sent seems no longer available. Can you pls re-share?. Also mind pls sending the specific command used to install rancher to ensure I have the same? Thanks.

mmartin24 avatar Jun 09 '25 07:06 mmartin24

@mmartin24 I was able to reproduce the issue without the aggregated-roletemplates=true parameter, so it may have nothing to do with that setting. The bundle does eventually become active, so this may point to some kind of performance degradation or delayed reconciliation rather than a configuration issue.

The attached video shows that the bundle is still not active after 7 minutes. (Note: the video does not show it becoming active, but it does transition to active after around 9 minutes.)

https://github.com/user-attachments/assets/8f7975ae-d739-498c-8857-09960fa61b6e

This is the command i used to install Rancher: docker run -d --restart=unless-stopped \ -p 80:80 -p 443:443 \ --privileged \ rancher/rancher:v2.12-8024ff603044f337d9e2df5492a45893a1133295-head

yonasberhe23 avatar Jun 10 '25 01:06 yonasberhe23

Thanks @yonasberhe23 . I see it only happens when deploying via docker. I add a describe on the bundle and I will check with the team.

> k describe -n fleet-local bundles.fleet.cattle.io 
Name:         fleet-agent-local
Namespace:    fleet-local
Labels:       objectset.rio.cattle.io/hash=2c708c6974cf00d8ca0d5c0f1706ab63548b55d2
Annotations:  objectset.rio.cattle.io/applied:
                H4sIAAAAAAAA/4yTSW/bPBCG/0owZ8qR7cgLge/0JeihgFukyyXJYUSObdZcBHLkxjX03wtKTuIGTdubRHKW93lnjuCIUSMjyCOg94GRTfAp/4b6GylOxKNowkghs6WRCZdGg4S1Je...
              objectset.rio.cattle.io/id: fleet-manage-agent
              objectset.rio.cattle.io/owner-gvk: /v1, Kind=Namespace
              objectset.rio.cattle.io/owner-name: fleet-local
              objectset.rio.cattle.io/owner-namespace: 
API Version:  fleet.cattle.io/v1alpha1
Kind:         Bundle
Metadata:
  Creation Timestamp:  2025-06-10T09:48:11Z
  Finalizers:
    fleet.cattle.io/bundle-finalizer
  Generation:  1
  Owner References:
    API Version:           v1
    Block Owner Deletion:  false
    Controller:            false
    Kind:                  Namespace
    Name:                  fleet-local
    UID:                   fe3aa9ca-5355-4263-a3ac-92711d59ccd8
  Resource Version:        7583
  UID:                     aee625c2-9a4b-47a3-8a7c-aa688265b52a
Spec:
  Default Namespace:  cattle-fleet-local-system
  Helm:
    Take Ownership:  true
  Ignore:
  Resources:
    Content:  apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cattle-fleet-local-system-fleet-agent-role
rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - '*'
- nonResourceURLs:
  - '*'
  verbs:
  - '*'

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cattle-fleet-local-system-fleet-agent-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cattle-fleet-local-system-fleet-agent-role
subjects:
- kind: ServiceAccount
  name: fleet-agent
  namespace: cattle-fleet-local-system

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-agent
  namespace: cattle-fleet-local-system

---
apiVersion: v1
automountServiceAccountToken: false
kind: ServiceAccount
metadata:
  name: default
  namespace: cattle-fleet-local-system

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fleet-agent
  namespace: cattle-fleet-local-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: fleet-agent
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: fleet-agent
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: fleet.cattle.io/agent
                operator: In
                values:
                - "true"
            weight: 1
      containers:
      - command:
        - fleetagent
        env:
        - name: BUNDLEDEPLOYMENT_RECONCILER_WORKERS
          value: "50"
        - name: DRIFT_RECONCILER_WORKERS
          value: "50"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: AGENT_SCOPE
          value: cattle-fleet-local-system
        - name: CHECKIN_INTERVAL
          value: 15m0s
        - name: CATTLE_ELECTION_LEASE_DURATION
          value: 30s
        - name: CATTLE_ELECTION_RETRY_PERIOD
          value: 10s
        - name: CATTLE_ELECTION_RENEW_DEADLINE
          value: 25s
        image: rancher/fleet-agent:v0.13.0-alpha.3
        imagePullPolicy: IfNotPresent
        name: fleet-agent
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
        volumeMounts:
        - mountPath: /.kube
          name: kube
        - mountPath: /tmp
          name: tmp
      nodeSelector:
        kubernetes.io/os: linux
      securityContext:
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
      serviceAccountName: fleet-agent
      tolerations:
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        operator: Equal
        value: "true"
      - effect: NoSchedule
        key: cattle.io/os
        operator: Equal
        value: linux
      volumes:
      - emptyDir: {}
        name: kube
      - emptyDir: {}
        name: tmp

---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-allow-all
  namespace: cattle-fleet-local-system
spec:
  egress:
  - {}
  ingress:
  - {}
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

    Name:  agent.yaml
  Targets:
    Cluster Name:  local
    Cluster Selector:
      Match Expressions:
        Key:       fleet.cattle.io/non-managed-agent
        Operator:  DoesNotExist
    Ignore:
Status:
  Conditions:
    Last Update Time:  2025-06-10T09:48:11Z
    Message:           WaitApplied(1) [Cluster fleet-local/local]
    Status:            False
    Type:              Ready
  Display:
    Ready Clusters:     0/1
    State:              WaitApplied
  Max New:              50
  Max Unavailable:      1
  Observed Generation:  1
  Partitions:
    Count:            1
    Max Unavailable:  1
    Name:             All
    Summary:
      Desired Ready:  1
      Non Ready Resources:
        Bundle State:  WaitApplied
        Name:          fleet-local/local
      Ready:           0
      Wait Applied:    1
    Unavailable:       1
  resourcesSha256Sum:  00e2868590836e23b55c653f1f5df8794c8f51c3cb1b987c931d4a76921212bd
  Summary:
    Desired Ready:  1
    Non Ready Resources:
      Bundle State:  WaitApplied
      Name:          fleet-local/local
    Wait Applied:    1
  Unavailable:       1
Events:              <none>

mmartin24 avatar Jun 10 '25 09:06 mmartin24

I see that Fleet-agent is taking almost 7-9 mins to be created by Rancher in the Docker env.

We use the helm to deploy Rancher and Fleet and we haven't seen this amount of time to create Fleet-agent pod.

Image

sbulage avatar Jun 10 '25 10:06 sbulage

I took a look with the team.

Please see these logs from docker container: docker-logs.log

There are like 8 minutes of this error...

2025/06/10 11:34:09 [ERROR] 2025/06/10 11:34:09 http: TLS handshake error from 192.168.68.112:39376: remote error: tls: unknown certificate

...until it finally gets updated:

2025/06/10 11:42:27 [INFO] certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca@1749555088,O=dynamiclistener-org: notBefore=2025-06-10 11:31:28 +0000 UTC notAfter=2026-06-10 11:42:27 +0000 UTC
2025/06/10 11:42:27 [INFO] Updating TLS secret for cattle-system/tls-rancher-internal (count: 3): map[field.cattle.io/projectId:local:p-78s82 listener.cattle.io/cn-10.43.55.25:10.43.55.25 listener.cattle.io/cn-172.17.0.2:172.17.0.2 listener.cattle.io/fingerprint:SHA1=29E89F3FF2B48E15D0D07B206993A547FFE6F731]

It seems this update is not done by Fleet:

2025/06/10 11:34:11 [INFO] Updating TLS secret for cattle-system/serving-cert (count: 6): map[[field.cattle.io/projectId:local:p-78s82](http://field.cattle.io/projectId:local:p-78s82) [listener.cattle.io/cn-127.0.0.1:127.0.0.1](http://listener.cattle.io/cn-127.0.0.1:127.0.0.1) [listener.cattle.io/cn-172.17.0.2:172.17.0.2](http://listener.cattle.io/cn-172.17.0.2:172.17.0.2) [listener.cattle.io/cn-](http://listener.cattle.io/cn-)

So whatever causes this issue in Docker, it seems not Fleet related. I hope this can help to narrow down more the problem.

mmartin24 avatar Jun 10 '25 14:06 mmartin24

Not a blocker per se but there's a big performance hit here. Let me know if Bullseye needs to help out here but I'll allow Rancher folks to dive in first.

gaktive avatar Jun 11 '25 18:06 gaktive

@gaktive @yonasberhe23 should we move it to rancher/rancher ?

torchiaf avatar Jun 13 '25 07:06 torchiaf

@torchiaf this ticket is already in r/r but I may remove this from the UI project board.

@manno is this on Fleet's radar at all?

gaktive avatar Jun 18 '25 17:06 gaktive

@yonasberhe23 Slightly confused - which e2e tests are failing? Are they some of your jenkins suite, as I don't see issues with the e2e tests in PRs.

nwmac avatar Jun 23 '25 16:06 nwmac

As I understand things via @aruiz14, some of dashboard's CI requires a dashboard ticket. Since this is in rancher/rancher, I can transfer this back to rancher/dashboard and see if that helps

gaktive avatar Jun 23 '25 21:06 gaktive

@yonasberhe23 can you rerun E2E with this ticket now in r/d?

gaktive avatar Jun 23 '25 21:06 gaktive

For reference, e2e test affected is cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts

My docker instance hit https://github.com/rancher/rancher/issues/50636, which causes a crash loop.

In theory the creation of the bundle in the test and description should never succeed, they do not have a target gh repo.

richard-cox avatar Jun 24 '25 08:06 richard-cox

@yonasberhe23 Slightly confused - which e2e tests are failing? Are they some of your jenkins suite, as I don't see issues with the e2e tests in PRs.

@nwmac, There are two E2E tests in bundles.spec.ts that fail due to this issue. I’ve temporarily commented out the assertions, each with a note referencing this problem. These tests aren’t failing in PRs because the assertions are currently skipped. Once the TLS-related delay is resolved, I plan to uncomment those assertions. https://github.com/rancher/dashboard/blob/ac2042ca7bcab4292d90f6edcd49bbd1ebb5d3f1/cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts#L136-L137 https://github.com/rancher/dashboard/blob/ac2042ca7bcab4292d90f6edcd49bbd1ebb5d3f1/cypress/e2e/tests/pages/fleet/resources/bundles.spec.ts#L208-L209

yonasberhe23 avatar Jun 24 '25 17:06 yonasberhe23

Thanks everybody involved, indeed this helped identifying a corner case in Fleet's agentmanagement controllers, which has a backoff mechanism that can take up to 10 minutes between retries after reaching a certain amount of failures. This controller was treating the absence of a server URL (neither from the overall config nor per-cluster settings) as a failure and kept retrying, so when there is really a valid value available, it will still be affected by the backoff: https://github.com/rancher/fleet/issues/3838 I've submitted a PR to refactor that part and I confirmed that it will now give up on retrying under that specific case, as well as proceed with the cluster import only seconds after the server-url Setting is set in Rancher.

Nonetheless, my suggestion is still to set the CATTLE_SERVER_URL environment variable during installation (matching the behavior to Helm installations), as well as selecting a URL that is really reachable from the containers running in the embedded k3s (e.g. localhost is not a valid value), in order to the fleet-agent to be deployed successfully. cc/ @richard-cox @yonasberhe23

aruiz14 avatar Jun 26 '25 11:06 aruiz14

I checked this issue here: https://github.com/rancher/fleet/issues/3838#issuecomment-3034938838 and to me it deploys correctly Fleet bundles and clusters within a few minutes (3/5). Please note as Alejandro mentions I did not use localhost but instead a valid value (my IP) on the URL selection. Please feel free to take a look on your side. Thanks.

mmartin24 avatar Jul 04 '25 08:07 mmartin24

All good on our end! ✅ Thanks to everyone involved in investigating and resolving this issue! 🙌

yonasberhe23 avatar Jul 18 '25 16:07 yonasberhe23