cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Introduce distributionVersion field for improved Kubernetes Distribution version handling

Open rccrdpccl opened this issue 9 months ago • 9 comments

What would you like to be added (User Story)?

As a user I would like to have a distributionVersion field to better handle versioning of the kubernetes distribution being installed

Detailed Description

Currently, the handling of Kubernetes versions in the version field is problematic when installing or upgrading a distribution of Kubernetes that utilizes its own version scheme and lifecycle. A distribution can include more software components than just Kubernetes, and/or a distinct support lifecycle that includes the release of fixes and other changes on its own timeline. Those characteristics can necessitate the use of an independent version scheme. The version field, present in multiple resources: ControlPlane, MachineSpec, Cluster Topology, specifically represents the Kubernetes version. In order to control the cluster’s distribution version, we need to specify its value for the ControlPlane and MachineSet/MachineDeployment objects involved. This is problematic when a user wants to deploy a specific version of a kubernetes distribution, they cannot specify the version, but it should be calculated based on the related kubernetes version (which is not always possible in any case).

To address this, we propose introducing a new field, spec.distributionVersion in the following resources:

  • ControlPlane (ControlPlane.Spec.Version)
  • MachineSpec (Machine.Spec.Version /MachineSet.Spec.Template.Spec.Version/MachineDeploymen.Spec.Template.Spec.Version/MachinePool.Spec.Template.Spec.Version)
  • Topology (Cluster.Topology.Version)

The new field and the current spec.version would be mutually exclusive.

This would be an optional field for both the ControlPlane contract and MachineSpec. Its value would be the distribution version (e.g. for OpenShift it should be something like v4.17.0). No version semantics will be imposed on the field.

If distributionVersion is present, the status.version should be populated with the related kubernetes version (e.g. for OpenShift with distributionVersion with value 4.17.0, status.version should be 1.30.4).

All logic that makes a decision based on evaluating the kubernetes version should rely on the status field instead of the spec field.

Related:

https://github.com/kubernetes-sigs/cluster-api/pull/11564

/cc @fabriziopandini @enxebre @sbueringer

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature /area api

rccrdpccl avatar Feb 07 '25 13:02 rccrdpccl

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Feb 07 '25 13:02 k8s-ci-robot

Thanks for filing this issue as a follow up of the discussion on the PR!

Somehow related, we should consider also KEP-4330: Compatibility Versions in Kubernetes which is introducing emulated-versions and min compatibility version as a key info to influence the cluster behaviour (see user stories in the KEP)

fabriziopandini avatar Feb 07 '25 19:02 fabriziopandini

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 08 '25 19:05 k8s-triage-robot

/remove-lifecycle stale

rccrdpccl avatar May 14 '25 16:05 rccrdpccl

Discussed in the 22nd of May office hours

The new field and the current spec.version would be mutually exclusive.

Unfortunately, the idea of leaving the spec.version field under some circumstances is going to create many problems in CAPI.

We need a plan (possibily not invasive) to address the fact that CAPI is full of code paths where the code assumes that version is set and represents a K8s version, and this is at the core of fundationational construct like the entire upgrade process and all the relelated test machinery.

Hopefully previous comments on the same topic might help in starting the work on this plan

  • https://github.com/kubernetes-sigs/cluster-api/pull/11564#issuecomment-2538767839
  • https://github.com/kubernetes-sigs/cluster-api/pull/11564#issuecomment-2548223084
  • https://github.com/kubernetes-sigs/cluster-api/pull/11564#issuecomment-2579184582

Also, worth to recall https://github.com/kubernetes-sigs/cluster-api/issues/11816#issuecomment-2643846693 from above and to bring up possible impacts on the ongoing discussion in https://github.com/kubernetes-sigs/cluster-api/pull/12199

fabriziopandini avatar May 22 '25 10:05 fabriziopandini

Unfortunately, the idea of leaving the spec.version field under some circumstances is going to create many problems in CAPI.

Would moving it to status (mirror it when spec is defined, and derive from distributionVersion when not) be too disruptive? I totally understand that K8s version should always be available, that was one of the target of this initial proposal.

In any case I'll take all this into account and get back with a possibly improved and more detailed proposal, thanks for the feedback - much appreciated 🙏

rccrdpccl avatar May 22 '25 12:05 rccrdpccl

If distributionVersion is present, the status.version should be populated with the related kubernetes version (e.g. for OpenShift with distributionVersion with value 4.17.0, status.version should be 1.30.4).

What component would be responsible for this?

Would moving it to status (mirror it when spec is defined, and derive from distributionVersion when not) be too disruptive? I totally understand that K8s version should always be available, that was one of the target of this initial proposal.

This could be beneficial for more than just the proposal here. Do we do any validation of upgrades at the moment? If we had the option of spec (I want this) and status (controller has verified through some rules that the upgrade is permitted), then that could unlock more in the way of pre-flight checks couldn't it?

I know within OpenShift for example we have similar patterns in other places where a user requests "I want this" and we validate that transition before accepting, and then the controllers observe the status where we've said "yes, we verify this transition is acceptable"

JoelSpeed avatar May 22 '25 16:05 JoelSpeed

What component would be responsible for this?

I think for each CR it's owner should be responsible for updating this field. However we need a component that would resolve the k8s version for each distribution version. I don't think this task is fitting particularly in any currently defined component. If we want to discuss implementation, what I had in mind is to have a thin controller (optional in general, but mandatory for distributionVersion usage) maintaining a new CRD DistributionVersionSet which would have a list of distributionVersion/k8s version pairs, and each controller could watch that resource to then update its status field - something like the following:

apiVersion: cluster.x-k8s.io/v1beta1
kind: DistributionVersionSet
metadata:
   name: ocp-distribution-versions
spec:
  versions:
  - distributionVersion: 4.18.0
    version: 1.31.1
  - distributionVersion: 4.19.4
    version: 1.33.4

This way Cluster, MachineDeployment, Controlplane, etc controllers could just look up the values and fill the status of their own resources.

rccrdpccl avatar May 26 '25 13:05 rccrdpccl

I like the idea of abstracting this away into something that components can look up, that makes sense. I do wonder though, how would the components know which distribution version set they need to look at? Would there be some configuration on the cluster that would reference the correct set for example?

JoelSpeed avatar May 28 '25 14:05 JoelSpeed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 26 '25 14:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 25 '25 15:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 25 '25 15:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 25 '25 15:10 k8s-ci-robot