cluster-api 🐛InfrastructureMachineSpec.FailureDomain is required to be a String

While writing https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1466 I encountered this code https://github.com/kubernetes-sigs/cluster-api/blob/327aae23835bb184aa7e2afa592618529e77f16f/internal/controllers/machine/machine_controller_phases.go#L305-L314 which attempts to copy the contents of a field called failureDomain in the infrastructure machine spec to the machine spec.

This places an additional constraint on the infrastructure machine spec in that a field called failureDomain, if it exists, must be a string. If this limitation is intentional it is not documented with the rest of the failure domain interface in the cluster API book: https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html

The behaviour of this code is to overwrite MachineSpec.FailureDomain with the value in InfrastructureMachineSpec.FailureDomain if it is defined. There are only 2 cases where this would do anything:

When the infrastructure provider may schedule a machine in a failure domain other than the one which was requested.
When no failure domain was explicitly requested, but we want to communicate to CAPI which one was scheduled by the infrastructure provider anyway.

Hopefully no providers have the behaviour in (1). It would be almost impossible to provide reliable infrastructure in this case.

So my best guess is that the intention was (2)? However, in this case the CAPI controller isn't going to use the failure domain so we've placed a restriction on the infrastructure provider for aesthetic reasons?

Does anybody know if any provider is using this behaviour, and why?
If no provider is using it, could it be removed?
Either way, should it be documented?

Anything else you would like to add: The specific error produced is:

E0213 10:22:43.407582      11 controller.go:329] "Reconciler error" err="failed to retrieve failure domain from infrastructure provider for Machine \"cluster-mbooth-control-plane-7k725\" in namespace \"mbooth\": failed to json-decode field \"spec.failureDomain\" value from \"infrastructure.cluster.x-k8s.io/v1alpha6, Kind=OpenStackMachine\": json: cannot unmarshal object into Go value of type string" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="mbooth/cluster-mbooth-control-plane-7k725" namespace="mbooth" name="cluster-mbooth-control-plane-7k725" reconcileID=d1a829f9-2485-4ab0-a9dc-01575684fb12

/kind bug

Feb 13 '23 10:02 mdbooth

If I'm reading this right the infrastructureMachine field is not designed to be used with newer versions of the API - it exists for backward compatibility.

From https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure.html:

failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.

Feb 13 '23 10:02 killianmuldoon

Ah, so it was an upgrade thing at some point?

Sounds like we should remove it? It's just a trip hazard now.

Feb 13 '23 10:02 mdbooth

Sounds like we should remove it? It's just a trip hazard now.

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

Feb 13 '23 10:02 killianmuldoon

I think I will move our 'hydrated' failure domain to the InfrastructureMachineStatus instead. This is a bit of a workaround, but it was a reasonable candidate for somewhere to stash it anyway. It's a bit of a grey area because, while the failure domain is part of the specification of the machine, it's not a field I expect a user to typically enter manually: the machine controller copies the hydrated value based on the reference in the MachineSpec before creating the machine.

The primary issue here is that we need to ensure the InfrastructureMachineSpec remains immutable for its entire lifetime. This is complex to do when you only have a reference to some of it, and the referenced object may be altered or deleted. The simplest solution is to just copy it and ignore the reference thereafter. This also leaves us maximum flexibility for wider scoped solutions in the future.

Feb 13 '23 11:02 mdbooth

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

Just referencing my comment from here: https://github.com/kubernetes-sigs/cluster-api/pull/8071#issuecomment-1429545502

My plan was to remove the v1alpha3 & v1alpha4 API types and packages in core CAPI. I wasn't planning to touch the contract stuff.

I think we should have a separate discussion & issue about leftovers of previous contracts. In my mind the contract covers how core Cluster API interacts with providers (specifically bootstrap, control plane, infra resources). As far as I can tell a specific Cluster API version only is supposed to support one contract "version" (e.g. v1beta1) at a time. If we still have code in Cluster API today that covers "previous contract versions" I think we now have to consider this as part of the v1beta1 contract. We should figure out what code that exactly is and then I think we should deprecate this behavior ("with the v1beta1 contract") and can eventually get rid of it once we bump the contract to v1beta2.

Feb 14 '23 13:02 sbueringer

/triage accepted

Feb 16 '23 22:02 fabriziopandini

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Feb 16 '24 22:02 k8s-triage-robot

/priority important-longterm

Apr 11 '24 19:04 fabriziopandini

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 10 '24 19:07 k8s-triage-robot

/remove-lifecycle stale

Jul 11 '24 06:07 sbueringer

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

Jul 31 '24 14:07 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 31 '24 14:07 k8s-ci-robot