cluster-api Cluster API should not use controller refs for resources that it does not create

What steps did you take and what happened:

Attempting to create a Cluster using an external tool such as Crossplane causes issues with both CAPI and Crossplane wanting to have the singular controller reference for resources that are created from Crossplane.

We've used controller refs in the past to avoid the possibility of multiple Clusters owning the same resources, but siince this hampers interoperability, the use of non-controller ownerReferences should be used and the verification that a resource isn't "owned" by the same cluster should be handled separately.

What did you expect to happen:

Crossplane should be able to create a Cluster API Cluster

Anything else you would like to add:

Anywhere that Cluster API controllers are creating resources should still continue to use controller references.

/kind bug

Dec 11 '20 15:12 detiber

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

Would you be able to give more details on what the error is? Owner references are also used for clusterctl move as of today, and it seems that semantically that's the right place to understand ownership.

Regarding the Crossplane integration, I'm curious to understand how that would work 😄 given that we've had some similar ideas in mind for the infrastructure providers, is that something you could share?

Dec 11 '20 15:12 vincepri

@vincepri I can give some more insight here, and would also love to attend a community meeting to present some ideas / demos. CAPI and Crossplane can interact at two main levels:

CAPI could rely on Crossplane providers to supply the backing infrastructure for cloud providers as Kubernetes objects. Crossplane providers are similar to the CAPI infrastructure providers except they are more general purpose and support all types across a provider rather than just those needed to spin up a k8s cluster.
Crossplane can orchestrate CAPI types using composition. You can see an extremely rough sketch of what this could look like in a demo we are putting together. This is the area in which we are running into issues as Crossplane's composition engine adds some controller references to the objects it composes, which CAPI is relying on controlling for its operations. We specifically hit this with the PacketCluster needing to be controlled by the CAPI Cluster type.

The Cluster is kind of an interesting case because I would typically not put controller references on a user-created object (i.e. the PacketCluster is part of the CAPI quickstart output that a user applies, or in this case, Crossplane applies), but I understand the rationale for it here. The controller refs are just one example of how Crossplane and CAPI can encounter situations where they don't play nicely, but I was actually pleasantly surprised that there were fewer conflicts than I was expecting.

I think a broader discussion about how CAPI and Crossplane can integrate would be beneficial here, and I am more than happy to create some content that can be presented. In many ways, the components of CAPI are similar to many Crossplane providers:

The Kubeadm bootstrap provider follows a similar model to projects like provider-helm and provider-sql
The various infrastructure providers are similar to their Crossplane counterparts (e.g. provider-gcp, provider-aws, etc.)
The core CAPI functionality is akin to a more opiononated Crossplane composition engine

Dec 11 '20 16:12 hasheddan

/milestone v0.4.0 given that some investigation work is going on

Mar 05 '21 20:03 fabriziopandini

@hasheddan we are trying to close API changes for the v1alpha4 version. Are there some some actionable outcome from your investigation?

/priority important-soon

Mar 10 '21 15:03 fabriziopandini

@fabriziopandini yes, there has been significant work around using crossplane as a drop in replacement for types in cluster-api-provider-gcp. The work is being tracked in https://github.com/crossplane/provider-gcp/issues/303 and most of it has taken place on livestreams that have been recorded. I would love to give a demo to CAPI community soon (potentially next week) -- what would you recommend as best forum for that?

Note that I believe that any crossplane integration will not need to block v1alpha4 API changes as one of the benefits of the model is that we can mold the composite resources (which are the drop-in types for existing CAPI provider types) to fit any API.

/cc @vincepri

Mar 10 '21 17:03 hasheddan

I think for now we can delay a change in references until we find a suitable approach that works across all the proposed integrations, for now we can possibly use what's in #4135 to have a crossplane integration

/milestone Next

Mar 10 '21 17:03 vincepri

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 08 '21 18:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

Jul 08 '21 19:07 fejta-bot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Aug 07 '21 20:08 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 07 '21 20:08 k8s-ci-robot

/reopen

i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.

above @vincepri wrote:

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as AWSCluster, it adds a controller ref linking it to the Cluster resource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.

i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".

my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.

Jan 17 '23 22:01 lukebond

@lukebond: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.

above @vincepri wrote:

IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.

from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as AWSCluster, it adds a controller ref linking it to the Cluster resource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.

i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".

my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 17 '23 22:01 k8s-ci-robot

/reopen

Jan 17 '23 23:01 CecileRobertMichon

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 17 '23 23:01 k8s-ci-robot

/remove-lifecycle rotten

Jan 17 '23 23:01 CecileRobertMichon

Sounds reasonable to me at a first glance.

I think we would have to differentiate between "standalone" clusters and clusters using clusterclass. With clusterclass the CAPI controller is actually creating infra cluster, control plane and MachineDeployments as well

Jan 18 '23 07:01 sbueringer

cc @killianmuldoon given recent work on ownerRefs

Jan 18 '23 07:01 sbueringer

Revisiting this seems reasonable to me. IIRC The rationale behind using controller: true originally was mainly to prevent multiple Clusters from fighting over the "descendent" resources e.g MachineDep.

Also pending having better docs https://github.com/kubernetes-sigs/cluster-api/issues/5487

Jan 18 '23 08:01 enxebre

Just another data point on this; the code base is full of assumptions about OwnerReferences to be in place, and unfortunately, there is not good documentation linking owners with those assumptions (AFAIK the only exception is https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#ownerreferences-chain that explicitly documents why it is relying on this info).

As a consequence changing owner reference should not be taken lightly, because it can lead to unexpected problems all around. We should figure out how to prevent regressions and how to communicate the change to all the Cluster API consumers that could have similar assumptions on their code

Jan 18 '23 09:01 fabriziopandini

Agree. I think we first should "lock down" and document the current behavior (which is what Killian was and is working on, e.g. also with: https://github.com/kubernetes-sigs/cluster-api/pull/7606, not sure if the ownerRef being a controller ref or not is part of that)

Then we can evolve it from there.

Jan 18 '23 10:01 sbueringer

/triage accepted /remove-priority important-soon /priority backlog

Mar 21 '23 09:03 fabriziopandini

/assign

I'll come back to this, organise documentation and see if there's places where we can move controller-references to non-exclusive OwnerRefs

May 25 '23 14:05 killianmuldoon

We've now merged documentation and tests to cover the current state of ownerReferences: https://github.com/kubernetes-sigs/cluster-api/pull/9153

I probably don't have time to follow up on the second part of this issue - i.e. deciding which ownerReferences should remain controllers and which do not, so unassigning myself for now.

/unassign

Aug 29 '23 13:08 killianmuldoon

/help

Aug 29 '23 13:08 killianmuldoon

@killianmuldoon: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 29 '23 13:08 k8s-ci-robot

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Aug 28 '24 13:08 k8s-triage-robot

Still something that it would be nice to nail down, if bandwidth allows /triage accepted

Sep 11 '24 13:09 fabriziopandini