fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Bundle controller hits massive update conflicts removing finalizers from shared content resources at scale

Open manno opened this issue 2 months ago • 1 comments

Noticed this during my last scaling experiment.

When a single content resource is shared by a large number of bundledeployments (e.g., thousands), the bundle reconciler runs into a race condition when trying to remove finalizers from that shared content resource.

This results in a constant stream of update conflicts. The reconciler's exponential backoff kicks in (?), and the queue effectively grinds to a halt. I was seeing only a handful of reconciles completing every few minutes. This stops all reconciles for that bundle of course. So it probably affects not only deletion, but also updates.

This is the spot in the code I was looking at: https://github.com/manno/fleet/blob/a3c2ba411c3f284f6140cdeb8c82861ca61eb106/internal/cmd/controller/reconciler/bundle_controller.go#L495-L497

Restarting the fleet-controller pod fixes the issue almost immediately. We probably need to find a way to make this finalizer removal less "chatty" or handle the fan-out scenario more gracefully.

manno avatar Oct 22 '25 09:10 manno

This will be implemented with https://github.com/rancher/fleet/issues/4192 as they are related.

0xavi0 avatar Nov 12 '25 10:11 0xavi0

/backport v2.13.1

0xavi0 avatar Nov 28 '25 15:11 0xavi0

QA Template.

Please refer to https://github.com/rancher/fleet/issues/4192#issuecomment-3596665898

0xavi0 avatar Dec 01 '25 13:12 0xavi0