fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Resources are missing after changing path pointing to same resources

Open sbulage opened this issue 7 months ago • 7 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

After changing GitRepo's path (which contains similar resources) resources are getting deleted and GitRepo shows Modified state.

Expected Behavior

Even after changing path, it should not delete the resources.

Steps To Reproduce

GitRepo Creation

  • Create a GitRepo using this path: without-diff
  • Wait for the resources to be created.
  • GitRepo status will be Modified due to one of the job is missing. Note: without-diff path we're not using diff in fleet.yaml

GitRepo Update

  • Edit above GitRepo and point to the with-diff
  • Wait for sometime
  • GitRepo stays in the same state i.e. Modified but due to the nginx resource missing not due to job missing. Note: with-diff path we are using diff in fleet.yaml which will change Modified state due to job missing to Active.

Environment

- Architecture:
- Fleet Version:107.0.0+up0.13.0-alpha.3
- Cluster:
  - Provider:K3s
  - Options:
  - Kubernetes Version:

Logs


Anything else?

See this video for more detailed steps to reproduce:

https://github.com/user-attachments/assets/b2da927e-7467-4cce-aeef-6a141c06a984

sbulage avatar Jun 05 '25 15:06 sbulage

I do not have access to the video (Github says No video with supported format and MIME type found), but a few tests suggest that this is intermittent. Using the provided example, when editing a GitRepo in-place, changing its path from without-diff to with-diff, config map test-simple-chart-config is sometimes deleted, leading to the bundle deployment, bundle and GitRepo seeing the resource as missing.

This seems likely to be caused by a race condition, although it needs a closer look.

weyfonk avatar Jun 30 '25 13:06 weyfonk

I've been investigating this one and I can conclude this is a race condition happening in fleet apply and it's caused when calling to pruneBundlesNotFoundInRepo in the CreateBundles function.

The Bundle name changes because the path changes, but the resources are the name (same kind, namespace and name). CreateBundles creates the new bundle when the path is changed to with-diff and deletes the previous one named after the previous path (without-diff). But the deletion is done while the new Bundle is being created, the function is called slightly later to be more precise. If the deletion is completed after the Bundle is fully created the resources are deleted (which leads to the missing resources message).

If the GitRepo is force updated the issue is gone (so there's a workaround).

0xavi0 avatar Jul 23 '25 07:07 0xavi0

Two ideas from our discussion:

  • we could extend the agent with a prune option. Setting this option,
    • would make the agent compute the resources to be deployed,
    • it would then try to delete the resources first and
    • only afterwards use the helm SDK to deploy the bundle.
    • - manual step, could just as well force redeploy the gitrepo
  • we could extend fleet apply to detect conflicts between bundles from the same gitrepo
    • add information about the old bundles to the new bundle (overwrites: [])
    • have the bundle reconciler requeue the bundle, i.e. not create a bundledeployment, until all listed bundles are gone
    • - only works within one gitrepo

manno avatar Jul 23 '25 09:07 manno

@0xavi0 I will repro this and update here, Thanks 😄

sbulage avatar Jul 30 '25 13:07 sbulage

Hello @0xavi0,

I tried 3-4 times, only 1 time I got success, but rest of the time I am able to reproduce it. Used latest 2.12 with 0.13.1-beta.1.

Re-adding video from description, in case it is not visible.

https://github.com/user-attachments/assets/65dcffd0-ccac-4921-98cc-83765eda0670

sbulage avatar Aug 04 '25 18:08 sbulage

We need to decide how to fix this race condition. Moving to 2.12.2 for now

0xavi0 avatar Aug 05 '25 08:08 0xavi0

This issue is a bit trickier to fix than we had initially thought, as user resources are deleted asynchronously by the agent when reconciling a deleted bundle deployment. Therefore, merely waiting for obsolete bundles, which no longer match a path in a GitRepo, to be deleted before creating new bundles does not prevent the race condition where a resource present in both bundles may be deleted after deploying the new bundle.

Work has started on a mitigation on this feature branch, based on detecting common resources between obsolete (to-be-deleted) and new bundles, enabling the agent to check a bundle deployment's ModifiedStatus for missing resources and, if missing resources are part of those overlaps, triggering new deployments which should re-create those missing resources.

This needs more troubleshooting: new deployments are triggered more than once, without updating the bundle deployment status. This may be because, since the Helm release already exists in the expected version, it is not reinstalled nor upgraded, and triggered re-deployments result in a no-op. A possible solution to that could consist in uninstalling the release before installing it again.

Edit: uninstalling the Helm release before installing it again seems to work; the solution needs better testing and error handling.

weyfonk avatar Oct 09 '25 10:10 weyfonk

Additional QA

Problem

Updating a path in a GitRepo, pointing from path A to path B where both paths contain the same resources (but possibly different configuration) may lead to missing resources after deploying the update.

Solution

  1. When creating a new bundle from path B, fleet apply detects overlaps in resources between the existing, in-cluster bundle (from path A) and the new bundle to be created from path B.
  2. When deploying the new bundle, if detecting a missing resource matching those overlapping resources detected in the previous step, the Fleet agent re-deploys the bundle.

Testing

Engineering Testing

Manual Testing

N/A

Automated Testing

End-to-end tests based on the reproduction steps above.

QA Testing Considerations

Please retry reproduction steps above.

Regressions Considerations

Existing test suites should run as usual.

weyfonk avatar Nov 17 '25 13:11 weyfonk