cluster-api
cluster-api copied to clipboard
⚠️ Machine ProviderID equality is now strictly enforced
What this PR does / why we need it:
This PR updates the capi ProviderID type equality enforcement (via the Equals object pointer receiver method) to validate against the entire provider ID string rather than a concatenation of the "cloud provider" plus "ID" substrings.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4526
@enxebre @alexeldeib is this PR getting close to addressing #4526? Obviously simply changing the capi equality implementation will require per-cloud-provider changes in their cluster-api providers in order to prevent breakage.
Is that what we want to advocate?
yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.
put differently: in AWS one node may be represented with multiple providerIDs, but a concrete node will only ever have one form of that provider ID, so I feel this should be safe.
probably good to get some CAPA eyes on this and maybe something less cloud-y like CAPD?
@sedefsavas can you PTAL at this change and confirm what @alexeldeib is suggesting
thank you!
@CecileRobertMichon are you able to speak authoritiatively on capd ramifications?
maybe something less cloud-y like CAPD?
CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.
/test ls
@CecileRobertMichon: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test pull-cluster-api-build-main/test pull-cluster-api-e2e-main/test pull-cluster-api-test-main/test pull-cluster-api-test-mink8s-main/test pull-cluster-api-verify-main
The following commands are available to trigger optional jobs:
/test pull-cluster-api-apidiff-main/test pull-cluster-api-e2e-full-main/test pull-cluster-api-e2e-informing-ipv6-main/test pull-cluster-api-e2e-informing-main/test pull-cluster-api-e2e-workload-upgrade-1-23-latest-main
Use /test all to run the following jobs that were automatically triggered:
pull-cluster-api-apidiff-mainpull-cluster-api-build-mainpull-cluster-api-e2e-informing-ipv6-mainpull-cluster-api-e2e-informing-mainpull-cluster-api-e2e-mainpull-cluster-api-test-mainpull-cluster-api-test-mink8s-mainpull-cluster-api-verify-main
In response to this:
maybe something less cloud-y like CAPD?
CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.
/test ls
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/test pull-cluster-api-e2e-full-main
yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.
CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration.
Also we should consistently update IndexKey().
@enxebre I'm not sure how IndexKey() is affected by this change
@enxebre I'm not sure how IndexKey() is affected by this change
// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.
@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.
@enxebre I'm not sure how IndexKey() is affected by this change
// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.@jackfrancis
IndexKey()was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.
I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)
I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)
indexKey is just an abstraction to be used to index (and fetch) machines/nodes by providerID in the controllers indexers. It must be unique and so It should use whatever we consider to be the equality contract, otherwise we could end up with one index mapping to multiple machines/nodes. If the impl of indexKey happens to be a thing wrapper that's fine.
@enxebre got it, updated
CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration
Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.
That would seem to require CAPI providers to be mutating the provider ID they get back from cloud providers (or perhaps CCM doing the same), which would be a world of hurt regardless of the contract.
At a glance, seems like CAPA just reads whatever EC2 returns, which is roughly the same as Azure. Not sure how CCM for AWS generates provider IDs, but I'd hope it's the same. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/7a545cd1aa5d05e5a0364a49b3c700341737ca61/controllers/awsmachine_controller.go#L510-L511
Definitely want to make sure we get this lined up right either way.
/test pull-cluster-api-e2e-full-main
Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.
@alexeldeib That's exactly right, so since we are changing the equality criteria this could technically break any provider out there. So at the very minimum this would require a note for providers implementers.
Also I agree with @killianmuldoon https://github.com/kubernetes-sigs/cluster-api/pull/6412#discussion_r849271035 we need to deprecate existing behaviour first as this is a public func.
/test pull-cluster-api-e2e-full-main
Correction to my comment above, the IndexKey() has now been made equivalent to String() (not CloudProvider()).
That change has been realized in the PR now, tests look good.
I think at this point we need to evaluate the potential for provider breakage, and if we can convince ourselves that this is actually fixing a capi bug based on the way that providerID URLs are set from their real-world authoritative sources, then we just want to make a communication plan so that any providers who need to update their code have time to do so.
/hold
I'm quite sure we had this behavior back in the v1alpha1 days and we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148
Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.
I might have missed this while going through the above messages, what is the problem we're trying to solve with this PR?
what is the problem we're trying to solve with this PR
The contract we rely on for provider ID comparison in CAPI is broken for Azure scale sets. The problematic logic is the logic in CAPI itself, not anything CAPZ* does -- we set provider ID in line with CCM, and there would be no other reasonable way to differentiate these machines like using AZ in AWS.
See https://github.com/kubernetes-sigs/cluster-api/issues/4526
In the existing providerID logic assumptions, cloudProvider and ID is what is used to compare, skipping the other segments of the providerID. See https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/noderefutil/providerid.go#L86
This makes the assumption - not necessarily true - that IDs in different regions/zones won't be reused by the cloud provider.
and yes, it's been there for a while. MachinePools came later, this issue does not exist for standalone VMs. See https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1503 as well.
The part about the ID in different AZ being different seems fair to me, it has been a known issue for a while. Do we have any valid alternative of comparison we can make for these IDs? For example, if we were to start to take into account the AvailabilityZone/FailureDomain/Region, would that suffice?
for some clarity -- with the current code the provider ID check in CAPI returns true when two VMs have IDs like:
/subscriptions/xxx/resourceGroups/xxx/Microsoft.Compute/virtualMachineScaleSets/xxx/virtualMachines/1 /subscriptions/xxx/resourceGroups/xxx/Microsoft.Compute/virtualMachineScaleSets/yyy/virtualMachines/1
even though these are different machines.
I guess you could try doing both checks, but if you ever fail the stricter one (e.g. node doesn't exist yet), you can still pass the "less strict" one (using the example above, if yyy/1 didn't exist yet but that's the machine we were looking for).
You could split on more tokens, but I don't think that works for other clouds.
e: ah, timing. I think the example should answer your q?
I think so, thanks for the example!
I vaguely remember at the time that providerIDs had some sort of structure, where the unique identifier was supposed to be at the end, but if this is not the case I think it might make sense to start requiring cloud providers to form an ID exactly as their CCM counterpart would.
Changing the behavior is a breaking change which we should introduce over time, if we can find a way like you suggested to keep the current behavior, and potentially alert on discrepancies we can avoid breaking current clusters and at the same time we can start working towards the end goal.
cc @yastij @sedefsavas @srm09
Let's enqueue this discussion topic for this week's community meeting as well, and make sure to reach out to cloud providers and mailing list to align providerID generation.
Thoughts?
we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148
Wouldn't this new full check still work for AWS though? Since cloud provider also sets the same "partial" provider ID with zone and instance ID: https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v2/instances.go#L246
Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.
Is there any valid use case for a provider ever setting a different provider ID on Nodes and Machines? IMO it makes more sense to enforce that contract of matching Machine Provider IDs with Node Provider IDs than breaking some providers like Azure because of assumptions made a long time ago based on how some providers format their provider IDs.
@CecileRobertMichon I think we're aligned on enforcing the contract; for AWS specifically it should work yes, what I'd like to propose is to introduce the change described here incrementally given that there might be infrastructure providers that are relying on the current behavior. Given that this is a breaking change, would we agree to:
- Discuss the the current issue and proposed solution at the Cluster API community meeting
- Send out an email to the mailing list with the proposed changes + an example
- Set out a deprecation policy for the fallback behavior
Thoughts?
potentially alert on discrepancies
I particularly like this. Might be tricky with a webhook given the cross-object (cross-cluster? mgmt <-> workload) objects, but perhaps a condition + some sort of quick validation with kubectl would be nice.
Will try to avoid suggesting much else in the weeds for now, but I think that could provide a nice path for e.g. new behavior opt-in -> you have a way to ensure you won't break -> new behavior by default (or something).
@jackfrancis do you have the bandwidth to take the steps outlined by Vince above?
@vincepri @alexeldeib can we re-clarify what the "fallback behavior" is exactly? Reading through the above thread, it seems pretty clear that the current state of the changeset in this PR does not include any fallback.
@jackfrancis A fallback could be defined as @alexeldeib pointed out above:
I guess you could try doing both checks, but if you ever fail the stricter one (e.g. node doesn't exist yet), you can still pass the "less strict" one (using the example above, if yyy/1 didn't exist yet but that's the machine we were looking for).
That said, I think given that this is a breaking change and there could be undefined behaviors by running both tests, we should probably just run a big warning in the release notes and make sure providers that are affected update their code appropriately.
/hold cancel