cluster-api ⚠️ Machine ProviderID equality is now strictly enforced

What this PR does / why we need it:

This PR updates the capi ProviderID type equality enforcement (via the Equals object pointer receiver method) to validate against the entire provider ID string rather than a concatenation of the "cloud provider" plus "ID" substrings.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #4526

Apr 12 '22 20:04 jackfrancis

@enxebre @alexeldeib is this PR getting close to addressing #4526? Obviously simply changing the capi equality implementation will require per-cloud-provider changes in their cluster-api providers in order to prevent breakage.

Is that what we want to advocate?

Apr 12 '22 20:04 jackfrancis

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

put differently: in AWS one node may be represented with multiple providerIDs, but a concrete node will only ever have one form of that provider ID, so I feel this should be safe.

probably good to get some CAPA eyes on this and maybe something less cloud-y like CAPD?

Apr 12 '22 21:04 alexeldeib

@sedefsavas can you PTAL at this change and confirm what @alexeldeib is suggesting

thank you!

@CecileRobertMichon are you able to speak authoritiatively on capd ramifications?

Apr 12 '22 21:04 jackfrancis

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

Apr 12 '22 21:04 CecileRobertMichon

@CecileRobertMichon: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main
/test pull-cluster-api-e2e-full-main
/test pull-cluster-api-e2e-informing-ipv6-main
/test pull-cluster-api-e2e-informing-main
/test pull-cluster-api-e2e-workload-upgrade-1-23-latest-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-informing-ipv6-main
pull-cluster-api-e2e-informing-main
pull-cluster-api-e2e-main
pull-cluster-api-test-main
pull-cluster-api-test-mink8s-main
pull-cluster-api-verify-main

In response to this:

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 12 '22 21:04 k8s-ci-robot

/test pull-cluster-api-e2e-full-main

Apr 12 '22 21:04 CecileRobertMichon

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration.

Also we should consistently update IndexKey().

Apr 13 '22 15:04 enxebre

@enxebre I'm not sure how IndexKey() is affected by this change

Apr 13 '22 16:04 jackfrancis

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

Apr 13 '22 16:04 enxebre

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

Apr 13 '22 17:04 jackfrancis

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

indexKey is just an abstraction to be used to index (and fetch) machines/nodes by providerID in the controllers indexers. It must be unique and so It should use whatever we consider to be the equality contract, otherwise we could end up with one index mapping to multiple machines/nodes. If the impl of indexKey happens to be a thing wrapper that's fine.

Apr 13 '22 17:04 enxebre

@enxebre got it, updated

Apr 13 '22 18:04 jackfrancis

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

That would seem to require CAPI providers to be mutating the provider ID they get back from cloud providers (or perhaps CCM doing the same), which would be a world of hurt regardless of the contract.

At a glance, seems like CAPA just reads whatever EC2 returns, which is roughly the same as Azure. Not sure how CCM for AWS generates provider IDs, but I'd hope it's the same. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/7a545cd1aa5d05e5a0364a49b3c700341737ca61/controllers/awsmachine_controller.go#L510-L511

Definitely want to make sure we get this lined up right either way.

Apr 13 '22 18:04 alexeldeib

/test pull-cluster-api-e2e-full-main

Apr 13 '22 18:04 jackfrancis

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

@alexeldeib That's exactly right, so since we are changing the equality criteria this could technically break any provider out there. So at the very minimum this would require a note for providers implementers.

Also I agree with @killianmuldoon https://github.com/kubernetes-sigs/cluster-api/pull/6412#discussion_r849271035 we need to deprecate existing behaviour first as this is a public func.

Apr 13 '22 18:04 enxebre

/test pull-cluster-api-e2e-full-main

Apr 13 '22 21:04 jackfrancis

Correction to my comment above, the IndexKey() has now been made equivalent to String() (not CloudProvider()).

That change has been realized in the PR now, tests look good.

I think at this point we need to evaluate the potential for provider breakage, and if we can convince ourselves that this is actually fixing a capi bug based on the way that providerID URLs are set from their real-world authoritative sources, then we just want to make a communication plan so that any providers who need to update their code have time to do so.

Apr 13 '22 23:04 jackfrancis

/hold

I'm quite sure we had this behavior back in the v1alpha1 days and we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148

Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.

I might have missed this while going through the above messages, what is the problem we're trying to solve with this PR?

May 02 '22 14:05 vincepri

what is the problem we're trying to solve with this PR

The contract we rely on for provider ID comparison in CAPI is broken for Azure scale sets. The problematic logic is the logic in CAPI itself, not anything CAPZ* does -- we set provider ID in line with CCM, and there would be no other reasonable way to differentiate these machines like using AZ in AWS.

See https://github.com/kubernetes-sigs/cluster-api/issues/4526

In the existing providerID logic assumptions, cloudProvider and ID is what is used to compare, skipping the other segments of the providerID. See https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/noderefutil/providerid.go#L86

This makes the assumption - not necessarily true - that IDs in different regions/zones won't be reused by the cloud provider.

and yes, it's been there for a while. MachinePools came later, this issue does not exist for standalone VMs. See https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1503 as well.

May 02 '22 14:05 alexeldeib

The part about the ID in different AZ being different seems fair to me, it has been a known issue for a while. Do we have any valid alternative of comparison we can make for these IDs? For example, if we were to start to take into account the AvailabilityZone/FailureDomain/Region, would that suffice?

May 02 '22 14:05 vincepri

for some clarity -- with the current code the provider ID check in CAPI returns true when two VMs have IDs like:

/subscriptions/xxx/resourceGroups/xxx/Microsoft.Compute/virtualMachineScaleSets/xxx/virtualMachines/1 /subscriptions/xxx/resourceGroups/xxx/Microsoft.Compute/virtualMachineScaleSets/yyy/virtualMachines/1

even though these are different machines.

I guess you could try doing both checks, but if you ever fail the stricter one (e.g. node doesn't exist yet), you can still pass the "less strict" one (using the example above, if yyy/1 didn't exist yet but that's the machine we were looking for).

You could split on more tokens, but I don't think that works for other clouds.

e: ah, timing. I think the example should answer your q?

May 02 '22 14:05 alexeldeib

I think so, thanks for the example!

I vaguely remember at the time that providerIDs had some sort of structure, where the unique identifier was supposed to be at the end, but if this is not the case I think it might make sense to start requiring cloud providers to form an ID exactly as their CCM counterpart would.

Changing the behavior is a breaking change which we should introduce over time, if we can find a way like you suggested to keep the current behavior, and potentially alert on discrepancies we can avoid breaking current clusters and at the same time we can start working towards the end goal.

cc @yastij @sedefsavas @srm09

May 02 '22 14:05 vincepri

Let's enqueue this discussion topic for this week's community meeting as well, and make sure to reach out to cloud providers and mailing list to align providerID generation.

Thoughts?

May 02 '22 14:05 vincepri

we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148

Wouldn't this new full check still work for AWS though? Since cloud provider also sets the same "partial" provider ID with zone and instance ID: https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v2/instances.go#L246

Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.

Is there any valid use case for a provider ever setting a different provider ID on Nodes and Machines? IMO it makes more sense to enforce that contract of matching Machine Provider IDs with Node Provider IDs than breaking some providers like Azure because of assumptions made a long time ago based on how some providers format their provider IDs.

May 02 '22 14:05 CecileRobertMichon

@CecileRobertMichon I think we're aligned on enforcing the contract; for AWS specifically it should work yes, what I'd like to propose is to introduce the change described here incrementally given that there might be infrastructure providers that are relying on the current behavior. Given that this is a breaking change, would we agree to:

Discuss the the current issue and proposed solution at the Cluster API community meeting
Send out an email to the mailing list with the proposed changes + an example
Set out a deprecation policy for the fallback behavior

Thoughts?

May 02 '22 15:05 vincepri

potentially alert on discrepancies

I particularly like this. Might be tricky with a webhook given the cross-object (cross-cluster? mgmt <-> workload) objects, but perhaps a condition + some sort of quick validation with kubectl would be nice.

Will try to avoid suggesting much else in the weeds for now, but I think that could provide a nice path for e.g. new behavior opt-in -> you have a way to ensure you won't break -> new behavior by default (or something).

May 02 '22 16:05 alexeldeib

@jackfrancis do you have the bandwidth to take the steps outlined by Vince above?

Jul 05 '22 20:07 CecileRobertMichon

@vincepri @alexeldeib can we re-clarify what the "fallback behavior" is exactly? Reading through the above thread, it seems pretty clear that the current state of the changeset in this PR does not include any fallback.

Jul 06 '22 09:07 jackfrancis

@jackfrancis A fallback could be defined as @alexeldeib pointed out above:

I guess you could try doing both checks, but if you ever fail the stricter one (e.g. node doesn't exist yet), you can still pass the "less strict" one (using the example above, if yyy/1 didn't exist yet but that's the machine we were looking for).

That said, I think given that this is a breaking change and there could be undefined behaviors by running both tests, we should probably just run a big warning in the release notes and make sure providers that are affected update their code appropriately.

Jul 06 '22 22:07 vincepri

/hold cancel

Jul 06 '22 22:07 vincepri

cluster-api cluster-api copied to clipboard

⚠️ Machine ProviderID equality is now strictly enforced

cluster-api
cluster-api copied to clipboard