Cluster API should not use controller refs for resources that it does not create
What steps did you take and what happened:
Attempting to create a Cluster using an external tool such as Crossplane causes issues with both CAPI and Crossplane wanting to have the singular controller reference for resources that are created from Crossplane.
We've used controller refs in the past to avoid the possibility of multiple Clusters owning the same resources, but siince this hampers interoperability, the use of non-controller ownerReferences should be used and the verification that a resource isn't "owned" by the same cluster should be handled separately.
What did you expect to happen:
Crossplane should be able to create a Cluster API Cluster
Anything else you would like to add:
Anywhere that Cluster API controllers are creating resources should still continue to use controller references.
/kind bug
IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.
Would you be able to give more details on what the error is? Owner references are also used for clusterctl move as of today, and it seems that semantically that's the right place to understand ownership.
Regarding the Crossplane integration, I'm curious to understand how that would work 😄 given that we've had some similar ideas in mind for the infrastructure providers, is that something you could share?
@vincepri I can give some more insight here, and would also love to attend a community meeting to present some ideas / demos. CAPI and Crossplane can interact at two main levels:
- CAPI could rely on Crossplane providers to supply the backing infrastructure for cloud providers as Kubernetes objects. Crossplane providers are similar to the CAPI infrastructure providers except they are more general purpose and support all types across a provider rather than just those needed to spin up a k8s cluster.
- Crossplane can orchestrate CAPI types using composition. You can see an extremely rough sketch of what this could look like in a demo we are putting together. This is the area in which we are running into issues as Crossplane's composition engine adds some controller references to the objects it composes, which CAPI is relying on controlling for its operations. We specifically hit this with the
PacketClusterneeding to be controlled by the CAPIClustertype.
The Cluster is kind of an interesting case because I would typically not put controller references on a user-created object (i.e. the PacketCluster is part of the CAPI quickstart output that a user applies, or in this case, Crossplane applies), but I understand the rationale for it here. The controller refs are just one example of how Crossplane and CAPI can encounter situations where they don't play nicely, but I was actually pleasantly surprised that there were fewer conflicts than I was expecting.
I think a broader discussion about how CAPI and Crossplane can integrate would be beneficial here, and I am more than happy to create some content that can be presented. In many ways, the components of CAPI are similar to many Crossplane providers:
- The Kubeadm bootstrap provider follows a similar model to projects like provider-helm and provider-sql
- The various infrastructure providers are similar to their Crossplane counterparts (e.g. provider-gcp, provider-aws, etc.)
- The core CAPI functionality is akin to a more opiononated Crossplane composition engine
/milestone v0.4.0 given that some investigation work is going on
@hasheddan we are trying to close API changes for the v1alpha4 version. Are there some some actionable outcome from your investigation?
/priority important-soon
@fabriziopandini yes, there has been significant work around using crossplane as a drop in replacement for types in cluster-api-provider-gcp. The work is being tracked in https://github.com/crossplane/provider-gcp/issues/303 and most of it has taken place on livestreams that have been recorded. I would love to give a demo to CAPI community soon (potentially next week) -- what would you recommend as best forum for that?
Note that I believe that any crossplane integration will not need to block v1alpha4 API changes as one of the benefits of the model is that we can mold the composite resources (which are the drop-in types for existing CAPI provider types) to fit any API.
/cc @vincepri
I think for now we can delay a change in references until we find a suitable approach that works across all the proposed integrations, for now we can possibly use what's in #4135 to have a crossplane integration
/milestone Next
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.
above @vincepri wrote:
IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.
from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as AWSCluster, it adds a controller ref linking it to the Cluster resource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.
i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".
my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.
@lukebond: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
i'd like to reopen this issue as originally titled; it had deviated into a discussion about a crossplane integration but i think it's more general.
above @vincepri wrote:
IIRC (although this was long ago) we've only used controller-type owner references where the lifecycle of that resource is managed by a Cluster API controller, while on the other resources we add the Cluster owner reference (or the parent) to make sure we have a way to tie things up together.
from looking at the code i think this is almost correct (granted it was a few years ago!) except for one thing: when adding the Cluster owner reference to it is actually a controller type ref that's being added. see here: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/cluster/cluster_controller_phases.go#L95-L98 - this is when reconciling objects such as
AWSCluster, it adds a controller ref linking it to theClusterresource. later, in the AWS provider, it checks for that link in order to start doing the infra work: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/controllers/awscluster_controller.go#L146 - if you dig into that function call you will see that it only checks for an ownerReference, not a controller ref.i would like to see cluster-api only create controller refs for resources it creates (e.g. the AWS provider creating machines from a machine deployment), but use only owner refs for everything else where it needs to tie "things up together".
my use case, for context, is very similar to the OP except instead of crossplane it's metacontroller. it's best-practice to take a controller ref of resources your controller creates and that's what metacontroller and crossplane are doing. in my case i am creating the following resources from a metacontroller operator: cluster, aws cluster, machine template, machine deployment, k3s controlplane, k3s config. metacontroller takes a controller ref which means that capi can't take one and the AWS provider won't start the infra work. i forked metacontroller to make setting controller: true in the ownerReference configurable and it works, but i think the problem is actually cluster-api side and i don't expect them to take that PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@CecileRobertMichon: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/remove-lifecycle rotten
Sounds reasonable to me at a first glance.
I think we would have to differentiate between "standalone" clusters and clusters using clusterclass. With clusterclass the CAPI controller is actually creating infra cluster, control plane and MachineDeployments as well
cc @killianmuldoon given recent work on ownerRefs
Revisiting this seems reasonable to me. IIRC The rationale behind using controller: true originally was mainly to prevent multiple Clusters from fighting over the "descendent" resources e.g MachineDep.
Also pending having better docs https://github.com/kubernetes-sigs/cluster-api/issues/5487
Just another data point on this; the code base is full of assumptions about OwnerReferences to be in place, and unfortunately, there is not good documentation linking owners with those assumptions (AFAIK the only exception is https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#ownerreferences-chain that explicitly documents why it is relying on this info).
As a consequence changing owner reference should not be taken lightly, because it can lead to unexpected problems all around. We should figure out how to prevent regressions and how to communicate the change to all the Cluster API consumers that could have similar assumptions on their code
Agree. I think we first should "lock down" and document the current behavior (which is what Killian was and is working on, e.g. also with: https://github.com/kubernetes-sigs/cluster-api/pull/7606, not sure if the ownerRef being a controller ref or not is part of that)
Then we can evolve it from there.
/triage accepted /remove-priority important-soon /priority backlog
/assign
I'll come back to this, organise documentation and see if there's places where we can move controller-references to non-exclusive OwnerRefs
We've now merged documentation and tests to cover the current state of ownerReferences: https://github.com/kubernetes-sigs/cluster-api/pull/9153
I probably don't have time to follow up on the second part of this issue - i.e. deciding which ownerReferences should remain controllers and which do not, so unassigning myself for now.
/unassign
/help
@killianmuldoon: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
Still something that it would be nice to nail down, if bandwidth allows /triage accepted