fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Fleet rollout gets stuck

Open thardeck opened this issue 2 months ago • 1 comments

Copy of the original issue #4144 by @manno, since it was kind of taken over by a similar but unrelated topic. To keep the context the issues are split.

I am using Fleet v0.13.2-rc.2 to deploy to 500 clusters. The rollout gets stuck only 164/500 bundledeployments are created. The deployment consists of 25k bundledeployments. The reconcile workers are configured to:

  --set controller.reconciler.workers.gitrepo=100 \
  --set controller.reconciler.workers.bundle=200 \
  --set controller.reconciler.workers.bundledeployment=200 \
  --set controller.reconciler.workers.cluster=100 \

Status of the stuck bundle:

   conditions:                                                                                                                                                              
   - lastUpdateTime: "2025-09-17T12:18:45Z"                                                                                                                                  
     message: 'targeting error: failed to get options secret for bundledeployment cluster-fleet-default-d0-k3k-downstream001-downstream0050-fa83d/scale-50-single-scale-50-bundles-twenty-sixteen,                                                                                                                                                              
       this is likely temporary: secrets "scale-50-single-scale-50-bundles-twenty-sixteen"                                                                      
       not found'                                                                                                                                                  
     reason: Error                                                                                                                                                           
     status: "False"                                                                                                                                                            
     type: Ready                                                                                                                                                                
   display:                                                                                                                                                                  
     readyClusters: 164/500                                                                                                                                                     
     state: WaitApplied

When I check, the secret is missing.

  • I waited 30 minutes.
  • Restarting the fleet-controller does not create the missing secret.
  • Upgrading to v0.14.0-alpha.2 does not create the secret.
  • Increasing the forceSyncGeneration re-creates all bundledeployments. Possibly getting stuck again.

thardeck avatar Nov 03 '25 10:11 thardeck

FTR I did try to reproduce this error using dartboard tests and I was not able.

This is what I did:

Deployed Fleet chart 0.13.0 and run the 40-run-test.sh script. In nutshell this script does:

  • Deploy 50 clusters
  • Deploys 2500 bundledeployments
  • Checks gitrepos have status active
  • Deletes all gitrepos
  • Ensures all secrets are deleted.

I checked everything was deleted and nothing stucked with 0.13.0. All good:

Starting test at: Thu 30 Oct 2025 04:47:23 PM CET

Clusters: 50
Agents: 50
Bundles: 50d
Total BundleDeployments: 2550
Fleet Version: 0.13.0

gitrepo.fleet.cattle.io/scale-50-single created

created gitrepo for 50 bundles: Thu 30 Oct 2025 04:47:25 PM CET
.
bundles exist: Thu 30 Oct 2025 04:47:34 PM CET

bundledeployments exist: Thu 30 Oct 2025 04:47:35 PM CET
.....
bundles ready: Thu 30 Oct 2025 04:48:04 PM CET

gitrepo ready: Thu 30 Oct 2025 04:48:04 PM CET

clusters ready: Thu 30 Oct 2025 04:48:05 PM CET

delete 
gitrepo.fleet.cattle.io "scale-50-single" deleted from fleet-default namespace

clusters 1/1: Thu 30 Oct 2025 04:48:06 PM CET
...............................No resources found

secrets removed: Thu 30 Oct 2025 04:51:16 PM CET
Warning: Permanently added 'ec2-98-93-104-25.compute-1.amazonaws.com' (ED25519) to the list of known hosts.
Warning: Permanently added 'ip-172-16-105-248.ec2.internal' (ED25519) to the list of known hosts.
828M	/var/lib/rancher/rke2/server/db/

Test completed at: Thu 30 Oct 2025 04:51:19 PM CET

Later upgraded to 0.13.2-rc.2 . All good as well...

Starting test at: Thu 30 Oct 2025 04:54:40 PM CET

Clusters: 50
Agents: 50
Bundles: 50d
Total BundleDeployments: 2550
Fleet Version: 0.13.2-rc.2

gitrepo.fleet.cattle.io/scale-50-single created

created gitrepo for 50 bundles: Thu 30 Oct 2025 04:54:41 PM CET
............................
bundles exist: Thu 30 Oct 2025 04:55:34 PM CET

bundledeployments exist: Thu 30 Oct 2025 04:55:35 PM CET
.........
bundles ready: Thu 30 Oct 2025 04:56:28 PM CET

gitrepo ready: Thu 30 Oct 2025 04:56:28 PM CET

clusters ready: Thu 30 Oct 2025 04:56:29 PM CET

delete 
gitrepo.fleet.cattle.io "scale-50-single" deleted from fleet-default namespace

clusters 1/1: Thu 30 Oct 2025 04:56:31 PM CET
............................No resources found

secrets removed: Thu 30 Oct 2025 04:59:41 PM CET
Warning: Permanently added 'ec2-98-93-104-25.compute-1.amazonaws.com' (ED25519) to the list of known hosts.
Warning: Permanently added 'ip-172-16-105-248.ec2.internal' (ED25519) to the list of known hosts.
828M	/var/lib/rancher/rke2/server/db/

Test completed at: Thu 30 Oct 2025 04:59:44 PM CET
----------------------------------------

... and once again downgraded to 0.13.0 and no issues:

----------------------------------------
Starting test at: Thu 30 Oct 2025 05:04:42 PM CET

Clusters: 50
Agents: 50
Bundles: 50d
Total BundleDeployments: 2550
Fleet Version: 0.13.0

gitrepo.fleet.cattle.io/scale-50-single created

created gitrepo for 50 bundles: Thu 30 Oct 2025 05:04:43 PM CET
..
bundles exist: Thu 30 Oct 2025 05:04:51 PM CET

bundledeployments exist: Thu 30 Oct 2025 05:04:52 PM CET
......
bundles ready: Thu 30 Oct 2025 05:05:27 PM CET

gitrepo ready: Thu 30 Oct 2025 05:05:27 PM CET

clusters ready: Thu 30 Oct 2025 05:05:28 PM CET

delete 
gitrepo.fleet.cattle.io "scale-50-single" deleted from fleet-default namespace

clusters 1/1: Thu 30 Oct 2025 05:05:29 PM CET
..............................No resources found

secrets removed: Thu 30 Oct 2025 05:08:36 PM CET
Warning: Permanently added 'ec2-98-93-104-25.compute-1.amazonaws.com' (ED25519) to the list of known hosts.
Warning: Permanently added 'ip-172-16-105-248.ec2.internal' (ED25519) to the list of known hosts.
828M	/var/lib/rancher/rke2/server/db/

Test completed at: Thu 30 Oct 2025 05:08:39 PM CET
----------------------------------------

mmartin24 avatar Nov 03 '25 12:11 mmartin24