serving icon indicating copy to clipboard operation
serving copied to clipboard

fix using KNative with ArgoCD (don't set `ownerReferences` for webhooks)

Open thesuperzapper opened this issue 1 year ago • 13 comments

Background

In 2021 KNative made it so that the ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources were "owned" by the Namespace which KNative is installed into (kanative-serving by default):

  • https://github.com/knative/pkg/pull/2098.

This was intended to ensure that users did not leave the webhooks when uninstalling (via deleting the Namespace) and break KNative when they re-installed (because the webhooks are cluster resources, and their backend would not exist and so would fail to validate anything).

Kubernetes has the concept of ownerReferences to indicate the relationships between resources. If an ownerReference sets blockOwnerDeletion, kubernetes will clean up these "child" resources before/after the "parent" resources is deleted (before: foreground delete, after: background delete).

For example, a Pod owned by a ReplicaSet might have the following ownerReferences:

apiVersion: v1
kind: Pod
metadata:
  name: xxxxxx
  namespace: xxxxxx
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: xxxxxx-759d8cb89
      uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Whats the problem?

This breaks ArgoCD, a very widely used GitOps system for Kubernetes.

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Here are some related upstream issues:

  • https://github.com/argoproj/argo-cd/issues/4764
  • https://github.com/argoproj/argo-cd/issues/11972
  • https://github.com/argoproj/argo-cd/issues/12210

What's the solution?

I propose we make two changes:

  1. Add a config which will disable the behavior of setting the ownerReferences:
    • For example, an environment variable like KNATIVE_DISABLE_WEBHOOK_OWNER which can be set on all controller pods.
  2. When we do set ownerReferences, we should NOT be setting controller=true:
    • This is much less likely to break downstream projects (obviously the Namespace is not the controlling resource). It also allows KNative to work with the upstream patch for ArgoCD that ignores non-controller ownerReferences.

Where is the relevant code?

The code which sets the ownerReferences lives in the knative-pkg libraries:

thesuperzapper avatar Aug 24 '24 23:08 thesuperzapper

@dprotaso @skonto @creydr thanks for you work on KNative! I wanted to highlight this important issue as it causes KNative to break when deployed with ArgoCD because the webhooks are not removed on uninstall.

thesuperzapper avatar Aug 24 '24 23:08 thesuperzapper

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

dprotaso avatar Sep 24 '24 20:09 dprotaso

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

@dprotaso there are a few issues with that:

  1. Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)
  2. Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

thesuperzapper avatar Sep 25 '24 03:09 thesuperzapper

Unsure if pkg/3095 closes this issue - since there's no end-user knob to disable the owner references

dprotaso avatar Sep 29 '24 18:09 dprotaso

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

dprotaso avatar Sep 29 '24 18:09 dprotaso

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

This is a common pattern, especially in app-of-apps deployments.

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

@dprotaso think you might be slightly misunderstanding.

What is happening is that ArgoCD will never delete the ValidatingWebhookConfiguration / MutatingWebhookConfiguration resources (because they have the owner reference), however, it will delete the other resources, including the back-end services for the webhooks.

This means that because the webhooks have a failurePolicy: Fail, when that the back end is deleted, any operations which match the webhook selectors are now blocked (because the backend is down).

The easiest way to reproduce this is simply create an ArgoCD app for KNative that has the resources-finalizer.argocd.argoproj.io cascading deletion finalizer, sync it, then try delete it.

This cascading delete will deadlock 100% of the time in the way described.

thesuperzapper avatar Sep 29 '24 21:09 thesuperzapper

@dprotaso I have made a follow-up PR in knative/pgk that adds an environment variable to disable the Namespace owner reference from being set in KNative Serving:

  • https://github.com/knative/pkg/pull/3103

It adds an environment variable called WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP which sets the DisableNamespaceOwnership config that was added by @jonathan-innis in:

  • https://github.com/knative/pkg/pull/3095

Next Steps

I think we should also backport it to 1.14 and 1.15 because otherwise its nearly impossible to deploy KNative Serving with ArgoCD. We might also want to add a basic reference to this in the docs (or hope people stumble across this issue).

That means we will need to cherry-pick the two PRs into the release-1.14 and release-1.15 of knative/pgk, as we can't just update the versions used to master.

We need to update the version of knative/pkg used in the following places:

thesuperzapper avatar Oct 14 '24 22:10 thesuperzapper

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

eg. Long term it would be great if we could do the ownership references declaratively https://github.com/kubernetes/kubernetes/issues/102810

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

dprotaso avatar Oct 15 '24 15:10 dprotaso

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

@dprotaso I think that both KNative and ArgoCD are doing something unexpected here.

While I agree that ArgoCD should have a workaround for apps that misuse the OwnerReferences, I think that we (KNative) are actually misusing OwnerRefernces by setting the controller: true flag. That is, the Namespace is not the "controlling" resource for the webhook configuration, so it's confusing ArgoCD which assumes that people never want to delete a resource that is controlled by another.

ArgoCD does this because deleting an owned resource is usually dangerous. For example, when using cert-manager, ArgoCD should never delete a secret that is managed by a Certificate resource (which is indicated by the Certificate "owning" the Secret).

eg. Long term it would be great if we could do the ownership references declaratively kubernetes/kubernetes#102810

I doubt that upstream Kubernetes would accept this because the point of OwnerRefernces is for controllers, not end users.

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

I really think this is a bug (from the KNative perspective) and since we are still officially supporting 1.14 it warrants a patch release.

thesuperzapper avatar Oct 15 '24 17:10 thesuperzapper

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

dprotaso avatar Oct 17 '24 16:10 dprotaso

Looks like there's already a PR out for that - https://github.com/argoproj/gitops-engine/pull/503

dprotaso avatar Oct 17 '24 16:10 dprotaso

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

@dprotaso sadly no, but we should still do that change anyway, because it's more semantically correct. At some point ArgoCD might change to ignoring non-controller owners (but there is no consensus in the ArgoCD community yet on if this would introduce it's own problems).

So in the meantime, it would be great to get https://github.com/knative/pkg/pull/3103 merged at at least made part of 1.15 so people can deploy kNative safely from ArgoCD.

thesuperzapper avatar Oct 17 '24 18:10 thesuperzapper

@dprotaso there was a slight issue with my first PR, but I have made a quick followup https://github.com/knative/pkg/pull/3107 which definitely works in my testing.

That is, when the WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP env-var is set to "true" on the "webhook" Deployments (running versions of Knative Serving with this patch), the ownerReferences are not set on the webhook configuration resources.

For example, you can apply the following Kustomize patch to do this:

patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: REPLACED_DURING_PATCH
      spec:
        template:
          spec:
            containers:
              - name: webhook
                env:
                  - name: WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP
                    value: "true"
    target:
      group: apps
      kind: Deployment
      name: ".*webhook.*"

thesuperzapper avatar Oct 18 '24 00:10 thesuperzapper

@thesuperzapper should we close this is done?

skonto avatar Oct 24 '24 09:10 skonto

@skonto it looks like this made its way into serving 1.16.0 and net-istio 1.16.0, so yes, we can probably close it.

thesuperzapper avatar Oct 24 '24 21:10 thesuperzapper

This happened to us today.

Here's the fix:

Add WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP env var to webhook container in knative-serving namespace.

Then commands ran:

kubectl delete MutatingWebhookConfiguration   "webhook.serving.knative.dev"
kubectl delete ValidatingWebhookConfiguration "validation.webhook.serving.knative.dev"
kubectl delete ValidatingWebhookConfiguration "config.webhook.serving.knative.dev"

kubectl delete certificate.networking.internal.knative.dev -n knative-serving         routing-serving-certs
kubectl delete certificate.networking.internal.knative.dev -n knative-serving-ingress routing-serving-certs

for a stuck KnativeServing in deletion state.

aliok avatar Sep 09 '25 14:09 aliok

FYI @aliok - I have an upstream Argo PR to help us avoid this problem - https://github.com/argoproj/gitops-engine/issues/501

dprotaso avatar Sep 09 '25 16:09 dprotaso

@thesuperzapper I'm trying reproduce this with ArgoCD and I cannot - do you have repro steps?

eg. I'm spinning up argo in kind and deploying a test helm chart I made

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: test
spec:
  destination:
    name: ''
    namespace: ''
    server: 'https://kubernetes.default.svc'
  source:
    path: ''
    repoURL: 'https://dprotaso.github.io/test-argo/'
    targetRevision: 0.1.0
    chart: knative-serving
  project: default

After installation I trigger a delete and it all is deleted correctly. Do you recall which ArgoCD version you were using?

dprotaso avatar Oct 03 '25 00:10 dprotaso