This code implements the semantics raised in #501, mainly so I could see what the effects would be. I am actually pretty happy that the result suits my use case with Cluster API quite happily.

However I am well aware that this code will break a lot of tests (mostly because none of the mock objects have controller: true on their owner references even when the real objects do, e.g. deployment/rs/pod) and that, because it changes the delete semantics it will never be accepted as-is.

What I would really like to have is an additional sync option that controls this behaviour, and would appreciate any advice on how to achieve a similar effect at a place in the code where I am able to use sync options.

Jan 30 '23 21:01 mkjpryor

What I am mainly concerned about is allowing Argo (i.e. gitops-engine) to delete resources that have owner references as long as none of the references have controller: true. However my understanding of the code is that gitops-engine identifies "top-level" resources (i.e. those without any owner references, as currently implemented) as it deploys the resource, and will only act on those resources at delete time but without checking the owner references again.

So I am struggling to see how I can get the behaviour I want at a point in the code where the sync options would be available.

Jan 30 '23 21:01 mkjpryor

Codecov Report

Base: 55.75% // Head: 55.67% // Decreases project coverage by -0.09% :warning:

Coverage data is based on head (b0e28e3) compared to base (ed70eac). Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
- Coverage   55.75%   55.67%   -0.09%     
==========================================
  Files          41       41              
  Lines        4525     4532       +7     
==========================================
  Hits         2523     2523              
- Misses       1808     1814       +6     
- Partials      194      195       +1

Impacted Files	Coverage Δ
pkg/cache/references.go	`61.53% <0.00%> (-6.07%)`	:arrow_down:
pkg/sync/common/types.go	`54.16% <ø> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Feb 01 '23 12:02 codecov[bot]

I have added a new sync option, but it is only respected on a per-resource annotation basis.

This works perfectly for my use case with Cluster API, but I am happy to make adjustments if required.

Feb 02 '23 15:02 mkjpryor

@jannfis @spjmurray @jaideepr97

Just flagging that I believe this patch addresses https://github.com/argoproj/argo-cd/issues/11972 and https://github.com/argoproj/argo-cd/issues/12210.

I have been running a custom build of Argo CD that includes this patch and can confirm that it completely fixes my issue. Build is here: https://github.com/stackhpc/argo-cd/pkgs/container/argocd/67404745?tag=v2.6.0-stackhpc.2

Like I said on the issue, this is similar but subtly different to @spjmurray's solution. In short, I don't believe we should ever delete anything with a reference that has controller: true as we will probably end up fighting another controller. @spjmurray - this actually turns out to be critical for the Cluster API use case because CAPI propagates annotations from MachineDeployments to the MachineSets that they manage, resulting in the owner reference being ignored for the MachineSet as well unless controller references are always respected.

I'm happy to do whatever tidying up is required to get this merged.

Feb 06 '23 21:02 mkjpryor

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

Feb 24 '23 09:02 sonarqubecloud[bot]

@mkjpryor, how does this handle node rollouts? We produce new AWSMachineTemplates, for instance, and when we do we no longer produce the old ones. Our MachineDeployments are then updated to reference the new templates. This triggers the creation of new MachineSets after which the replicas are scaled over from the old MachineSets to the new ones.

The problem is that this change would immediately clobber the old AWSMachineTemplates, making the old MachineSets invalid. While scaling them down wouldn't be an issue, scaling them up would be. Unfortunately, scale up during a node rollout can happen and does so by scaling up the old MachineSet. That will fail to spawn new nodes because of the missing template.

Have you run into this problem, and if so what do you do about it?

Jul 28 '23 15:07 javanthropus

The key problem this is solving is if it was generated by a helm template, and is no longer generated by the helm template, it gets deleted, period, do as I command! At present it's impossible to remove a machine deployment without some 3rd party automation that goes around deleting them manually due to owner references that get added by CAPI. I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

I'm guessing if you want old machine templates to live on then you keep them alive in the helm chart, once your "upgrade" has been completed, then remove them. That would be the simple solution. Not perfect however as you'd either need to keep all templates alive forever (which happens implicitly at present) which would probably bloat your charts, and have some tooling to remove stale ones once the referencing machine deployments get killed.

Hard problem however you look at it 😸

Jul 31 '23 09:07 spjmurray

@javanthropus

You are right - that is a major issue. In fact, it is so major that we never delete machine templates (OpenStackMachineTemaplates in our case) by using the helm.sh/resource-policy: keep annotation on them.

See https://github.com/stackhpc/capi-helm-charts/blob/main/charts/openstack-cluster/templates/node-group/openstack-machine-template.yaml#L61.

When you delete the cluster, they are cleaned up via cascading deletion because Cluster API puts non-controller owner references on them.

Jul 31 '23 09:07 mkjpryor

@spjmurray

When you say

I, for one, hate this code because it's a) a nasty hack b) flaky as hell (for reasons I shall not go into).

which specific piece of the puzzle are you referring to here?

Jul 31 '23 09:07 mkjpryor

Thank you, @mkjpryor. That's pretty much what I expected. Sadly, the Argo UI becomes cluttered and slow unless we clean up our old templates regularly because we have many different MachineDeployments. All of these require unique machine templates, which we update monthly in order to provide software updates via new machine images. Needless to say, the number of machine templates grows rapidly in our case, and Argo wants to track them all even though the vast majority are no longer needed.

IMO, this problem is caused by CAPI applying the resourceOwners incorrectly. Even if Argo would delete resources with non-controller owners set, it would prematurely clobber the machine templates in our case. Instead, CAPI should set the resourceOwners for machine templates to the MachineSet resource(s) that directly depend on them. When the MachineDeployment controller deletes unneeded MachineSets, the k8s GC would naturally kick in.

Machine templates used by KubeadmControlPlane resources would need to be handled a little differently. In that case, the ownerResources needs to be removed once the machine template is replaced. Argo should then be able to prune the machine template resource normally since it would no longer have any ownerReferences and would no longer be desired by gitops.

I've opened a thread in the CAPI Slack about this, so hopefully we can figure something out. Please drop in if you would like to contribute to the discussion.

Jul 31 '23 13:07 javanthropus

@mkjpryor

This bit... We essentially have to:

Derive what MDs we expect to be generated by the chart, which in itself is a pain as we have to know how the chart generates resource names
Find all the KCTs referenced by those MDs
Final all the OSMTs referenced by those MDs
Delete all the resources that aren't part of those sets

I could probably do it more intelligently by adding an annotation to all the resources so we tie them to the helm inputs, rather than following the links, but it's still sub optimal when Argo should be capable of deleting them itself.

Aug 01 '23 08:08 spjmurray

@javanthropus I did try to open a discussion with CAPI directly here https://github.com/kubernetes-sigs/cluster-api/issues/7913 it's definitely on the radar, but will take time.

Aug 01 '23 08:08 spjmurray

@spjmurray, thanks for that pointer. I piled onto that issue, and if the maintainers agree to my suggestion, I should be able to submit a PR.

Aug 01 '23 20:08 javanthropus

@mkjpryor do you have any news from maintainers for this?

Feb 17 '24 09:02 dntosas

Not at present

Mar 07 '24 12:03 mkjpryor

Only respect controller refs for resources

Codecov Report