fleet icon indicating copy to clipboard operation
fleet copied to clipboard

[SURE-10837] Helm values resolve to null during agent upgrade (Secret-based values race)

Open kkaempf opened this issue 3 months ago • 3 comments

SURE-10837

Issue description:

During the Fleet agent upgrade window (Rancher 2.11.3), apps may reconcile before the new Secret-based values are readable, causing Helm to see null values and deploy with chart defaults.

Business impact:

No Business Impact

Workaround:

Is a workaround available and implemented? no

Actual behaviour:

During the agent upgrade window, Fleet reconciles bundles concurrently. The app bundle can render before the new Secret/ConfigMap values are available (or before the upgraded agent reads them).

Result: Helm receives values = null → falls back to chart defaults → misconfigured deploy.

Expected behaviour:

No implicit ordering between bundles. Unless you declare it, Fleet does not guarantee “values first, then app.” During upgrades, a race is possible.

Files, logs, traces:

Refer to the attached

Additional notes:

  1. How is the fleet-agent included in your base image/template (e.g., baked into the AMI/OVA, installed via cloud-init/user-data, Ansible, or a post-install script)?

The fleet agent helm chart is part of a software tar bundle that gets installed on a Linux server. The install scripts ensure that the fleet agent is installed after the k3s service is installed.

  1. Customer's downstream clusters come with pre-packaged fleet-agent (0.9.2).  So even when the rancher is upgraded, the fleet agent package is not explicitly upgraded by them on the already provisioned clusters. They depend on the rancher/fleet mechanism to upgrade this agent for us on the next check-in.

So Rancher is upgraded from 2.8.3 -> 2.9.3 -> 2.10.3 -> 2.11.3. All this while the fleet-agent is at 0.9.2 when it connects first after registration

Note: They are upgrading the fleet-agent to 0.12.4 in our next software release. But the process of upgrading all downstream clusters to this release will take time.

  1. Customer have around 334 pods in the fleet default namespace. They added the aal-ps-qoe-apps pod log because that is the application that they were testing and that failed.

kkaempf avatar Oct 15 '25 08:10 kkaempf

SSE took a look:

One way i see do fix this is by using the DependsOn field (or something similar) to have a dependency sequence,

Might rather be a feature request ... 🤔

kkaempf avatar Oct 21 '25 08:10 kkaempf

What is expected for this ticket? This looks more like a support then a development ticket. The Rancher system seems to be in an inconsistent state with an old/stuck Fleet version.

thardeck avatar Nov 03 '25 12:11 thardeck

Related: #4284.

weyfonk avatar Nov 12 '25 13:11 weyfonk