[SURE-10837] Helm values resolve to null during agent upgrade (Secret-based values race)
SURE-10837
Issue description:
During the Fleet agent upgrade window (Rancher 2.11.3), apps may reconcile before the new Secret-based values are readable, causing Helm to see null values and deploy with chart defaults.
Business impact:
No Business Impact
Workaround:
Is a workaround available and implemented? no
Actual behaviour:
During the agent upgrade window, Fleet reconciles bundles concurrently. The app bundle can render before the new Secret/ConfigMap values are available (or before the upgraded agent reads them).
Result: Helm receives values = null → falls back to chart defaults → misconfigured deploy.
Expected behaviour:
No implicit ordering between bundles. Unless you declare it, Fleet does not guarantee “values first, then app.” During upgrades, a race is possible.
Files, logs, traces:
Refer to the attached
Additional notes:
- How is the fleet-agent included in your base image/template (e.g., baked into the AMI/OVA, installed via cloud-init/user-data, Ansible, or a post-install script)?
The fleet agent helm chart is part of a software tar bundle that gets installed on a Linux server. The install scripts ensure that the fleet agent is installed after the k3s service is installed.
- Customer's downstream clusters come with pre-packaged fleet-agent (0.9.2). So even when the rancher is upgraded, the fleet agent package is not explicitly upgraded by them on the already provisioned clusters. They depend on the rancher/fleet mechanism to upgrade this agent for us on the next check-in.
So Rancher is upgraded from 2.8.3 -> 2.9.3 -> 2.10.3 -> 2.11.3. All this while the fleet-agent is at 0.9.2 when it connects first after registration
Note: They are upgrading the fleet-agent to 0.12.4 in our next software release. But the process of upgrading all downstream clusters to this release will take time.
- Customer have around 334 pods in the fleet default namespace. They added the aal-ps-qoe-apps pod log because that is the application that they were testing and that failed.
SSE took a look:
One way i see do fix this is by using the DependsOn field (or something similar) to have a dependency sequence,
Might rather be a feature request ... 🤔
What is expected for this ticket? This looks more like a support then a development ticket. The Rancher system seems to be in an inconsistent state with an old/stuck Fleet version.
Related: #4284.