🐛InfrastructureMachineSpec.FailureDomain is required to be a String
While writing https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1466 I encountered this code https://github.com/kubernetes-sigs/cluster-api/blob/327aae23835bb184aa7e2afa592618529e77f16f/internal/controllers/machine/machine_controller_phases.go#L305-L314 which attempts to copy the contents of a field called failureDomain in the infrastructure machine spec to the machine spec.
This places an additional constraint on the infrastructure machine spec in that a field called failureDomain, if it exists, must be a string. If this limitation is intentional it is not documented with the rest of the failure domain interface in the cluster API book: https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html
The behaviour of this code is to overwrite MachineSpec.FailureDomain with the value in InfrastructureMachineSpec.FailureDomain if it is defined. There are only 2 cases where this would do anything:
- When the infrastructure provider may schedule a machine in a failure domain other than the one which was requested.
- When no failure domain was explicitly requested, but we want to communicate to CAPI which one was scheduled by the infrastructure provider anyway.
Hopefully no providers have the behaviour in (1). It would be almost impossible to provide reliable infrastructure in this case.
So my best guess is that the intention was (2)? However, in this case the CAPI controller isn't going to use the failure domain so we've placed a restriction on the infrastructure provider for aesthetic reasons?
- Does anybody know if any provider is using this behaviour, and why?
- If no provider is using it, could it be removed?
- Either way, should it be documented?
Anything else you would like to add: The specific error produced is:
E0213 10:22:43.407582 11 controller.go:329] "Reconciler error" err="failed to retrieve failure domain from infrastructure provider for Machine \"cluster-mbooth-control-plane-7k725\" in namespace \"mbooth\": failed to json-decode field \"spec.failureDomain\" value from \"infrastructure.cluster.x-k8s.io/v1alpha6, Kind=OpenStackMachine\": json: cannot unmarshal object into Go value of type string" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="mbooth/cluster-mbooth-control-plane-7k725" namespace="mbooth" name="cluster-mbooth-control-plane-7k725" reconcileID=d1a829f9-2485-4ab0-a9dc-01575684fb12
/kind bug
If I'm reading this right the infrastructureMachine field is not designed to be used with newer versions of the API - it exists for backward compatibility.
From https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure.html:
failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.
Ah, so it was an upgrade thing at some point?
Sounds like we should remove it? It's just a trip hazard now.
Sounds like we should remove it? It's just a trip hazard now.
I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.
I think I will move our 'hydrated' failure domain to the InfrastructureMachineStatus instead. This is a bit of a workaround, but it was a reasonable candidate for somewhere to stash it anyway. It's a bit of a grey area because, while the failure domain is part of the specification of the machine, it's not a field I expect a user to typically enter manually: the machine controller copies the hydrated value based on the reference in the MachineSpec before creating the machine.
The primary issue here is that we need to ensure the InfrastructureMachineSpec remains immutable for its entire lifetime. This is complex to do when you only have a reference to some of it, and the referenced object may be altered or deleted. The simplest solution is to just copy it and ignore the reference thereafter. This also leaves us maximum flexibility for wider scoped solutions in the future.
I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.
Just referencing my comment from here: https://github.com/kubernetes-sigs/cluster-api/pull/8071#issuecomment-1429545502
My plan was to remove the v1alpha3 & v1alpha4 API types and packages in core CAPI. I wasn't planning to touch the contract stuff.
I think we should have a separate discussion & issue about leftovers of previous contracts. In my mind the contract covers how core Cluster API interacts with providers (specifically bootstrap, control plane, infra resources). As far as I can tell a specific Cluster API version only is supposed to support one contract "version" (e.g. v1beta1) at a time. If we still have code in Cluster API today that covers "previous contract versions" I think we now have to consider this as part of the v1beta1 contract. We should figure out what code that exactly is and then I think we should deprecate this behavior ("with the v1beta1 contract") and can eventually get rid of it once we bump the contract to v1beta2.
/triage accepted
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/priority important-longterm
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)
We can improve, but I think this answer the question in the issue above
Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...
/close
@fabriziopandini: Closing this issue.
In response to this:
I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as
failureDomain (string)We can improve, but I think this answer the question in the issue above
Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.