Resources are missing after changing path pointing to same resources
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
After changing GitRepo's path (which contains similar resources) resources are getting deleted and GitRepo shows Modified state.
Expected Behavior
Even after changing path, it should not delete the resources.
Steps To Reproduce
GitRepo Creation
- Create a
GitRepousing this path:without-diff - Wait for the resources to be created.
-
GitRepostatus will beModifieddue to one of thejobismissing. Note:without-diffpath we're not usingdiffinfleet.yaml
GitRepo Update
- Edit above
GitRepoand point to thewith-diff - Wait for sometime
-
GitRepostays in the same state i.e.Modifiedbut due to thenginxresource missing not due tojobmissing. Note:with-diffpath we are usingdiffinfleet.yamlwhich will changeModifiedstate due to job missing toActive.
Environment
- Architecture:
- Fleet Version:107.0.0+up0.13.0-alpha.3
- Cluster:
- Provider:K3s
- Options:
- Kubernetes Version:
Logs
Anything else?
See this video for more detailed steps to reproduce:
https://github.com/user-attachments/assets/b2da927e-7467-4cce-aeef-6a141c06a984
I do not have access to the video (Github says No video with supported format and MIME type found), but a few tests suggest that this is intermittent. Using the provided example, when editing a GitRepo in-place, changing its path from without-diff to with-diff, config map test-simple-chart-config is sometimes deleted, leading to the bundle deployment, bundle and GitRepo seeing the resource as missing.
This seems likely to be caused by a race condition, although it needs a closer look.
I've been investigating this one and I can conclude this is a race condition happening in fleet apply and it's caused when calling to pruneBundlesNotFoundInRepo in the CreateBundles function.
The Bundle name changes because the path changes, but the resources are the name (same kind, namespace and name).
CreateBundles creates the new bundle when the path is changed to with-diff and deletes the previous one named after the previous path (without-diff).
But the deletion is done while the new Bundle is being created, the function is called slightly later to be more precise.
If the deletion is completed after the Bundle is fully created the resources are deleted (which leads to the missing resources message).
If the GitRepo is force updated the issue is gone (so there's a workaround).
Two ideas from our discussion:
- we could extend the agent with a
pruneoption. Setting this option,- would make the agent compute the resources to be deployed,
- it would then try to delete the resources first and
- only afterwards use the helm SDK to deploy the bundle.
- - manual step, could just as well force redeploy the gitrepo
- we could extend fleet apply to detect conflicts between bundles from the same gitrepo
- add information about the old bundles to the new bundle (overwrites: [])
- have the bundle reconciler requeue the bundle, i.e. not create a bundledeployment, until all listed bundles are gone
- - only works within one gitrepo
@0xavi0 I will repro this and update here, Thanks 😄
Hello @0xavi0,
I tried 3-4 times, only 1 time I got success, but rest of the time I am able to reproduce it. Used latest 2.12 with 0.13.1-beta.1.
Re-adding video from description, in case it is not visible.
https://github.com/user-attachments/assets/65dcffd0-ccac-4921-98cc-83765eda0670
We need to decide how to fix this race condition. Moving to 2.12.2 for now
This issue is a bit trickier to fix than we had initially thought, as user resources are deleted asynchronously by the agent when reconciling a deleted bundle deployment. Therefore, merely waiting for obsolete bundles, which no longer match a path in a GitRepo, to be deleted before creating new bundles does not prevent the race condition where a resource present in both bundles may be deleted after deploying the new bundle.
Work has started on a mitigation on this feature branch, based on detecting common resources between obsolete (to-be-deleted) and new bundles, enabling the agent to check a bundle deployment's ModifiedStatus for missing resources and, if missing resources are part of those overlaps, triggering new deployments which should re-create those missing resources.
This needs more troubleshooting: new deployments are triggered more than once, without updating the bundle deployment status. This may be because, since the Helm release already exists in the expected version, it is not reinstalled nor upgraded, and triggered re-deployments result in a no-op. A possible solution to that could consist in uninstalling the release before installing it again.
Edit: uninstalling the Helm release before installing it again seems to work; the solution needs better testing and error handling.
Additional QA
Problem
Updating a path in a GitRepo, pointing from path A to path B where both paths contain the same resources (but possibly different configuration) may lead to missing resources after deploying the update.
Solution
- When creating a new bundle from path B,
fleet applydetects overlaps in resources between the existing, in-cluster bundle (from path A) and the new bundle to be created from path B. - When deploying the new bundle, if detecting a missing resource matching those overlapping resources detected in the previous step, the Fleet agent re-deploys the bundle.
Testing
Engineering Testing
Manual Testing
N/A
Automated Testing
End-to-end tests based on the reproduction steps above.
QA Testing Considerations
Please retry reproduction steps above.
Regressions Considerations
Existing test suites should run as usual.