Add Feature Gates to Manage dra-example-driver Compatibility
In this issue, we discuss whether we should introduce feature gates for the dra-example-driver.
The core DRA API (Structured Parameters) will become GA in version 1.34 and will be enabled by default. Additionally, DRA features in beta will also start being enabled by default.
The Beta features in version 1.34 related to the dra-example-device are:
- DRA: Partitionable Devices
- DRA: Prioritized Alternatives in Device Requests
Even if the dra-example-driver implements these two features, they may not function properly in clusters where these features are not supported or enabled. For instance:
- Version 1.32 clusters only support core DRA as a beta feature.
- Version 1.33 clusters have core DRA in beta, while Partitionable Devices and Prioritized Alternatives are alpha features and may not enabled.
Furthermore, more DRA-related features will continue to be added upstream. Users installing the latest dra-example-driver's helm charts and images may encounter errors due to unsupported features in their cluster environments.
Should we introduce a kind of feature gates in the dra-example-driver to manage compatibility across different Kubernetes releases?
Here is a simple example illustrating this concept:
| DRA Feature | dra-example-driver v0.1.0 | dra-example-driver v0.2.0 | dra-example-driver v0.3.0 |
|---|---|---|---|
| Compatible K8s version | v1.32 | v1.33 | v1.34 |
| Partitionable Devices API | Disabled | Disabled | Enabled |
| Prioritized Alternatives | Disabled | Disabled | Enabled |
cc @nojnhuh @mortent @pohly @klueska
Prioritized Alternatives has no impact on the driver, or no impact that would need a feature gate (it might have to handle sub-request names, but I forgot the details - we might handle it for the driver in the helper).
Usage of partititionable devices would be covered by what hardware the driver is being asked to simulate, which is better handled by a separate flag than a feature gate.
The bigger question is: which API version should the driver and helper package use? We kept using v1beta1 in 1.33.
Pro: DRA drivers using that code, like the example driver, work with 1.32.
Con: they are less compatible with future Kubernetes releases, because v1beta1 gets removed one release earlier than v1beta2.
If we keep doing this right until v1beta1 really gets removed, then all DRA drivers will have to be updated before users can upgrade to that new release (no transition period!).
A technical solution to this problem would be to always use the latest API version in the Go API and provide a helper package which converts to an from an older API version if needed. But supporting v1beta1 this way would be harder because the conversion code is ugly and only available internally - we might have to move it.
/cc @klueska @mortent
/title compatibility with different Kubernetes releases
Usage of partititionable devices would be covered by what hardware the driver is being asked to simulate, which is better handled by a separate flag than a feature gate.
Personally, I’m fine with either featuregate” or flag.
Usage of partititionable devices would be covered by what hardware the driver is being asked to simulate, which is better handled by a separate flag than a feature gate.
I agree that a flag seems better for partitionable devices than a feature gate. A feature gate is only an on/off switch but a flag would allow more configuration describing what partitions are available.
The bigger question is: which API version should the driver and helper package use? We kept using v1beta1 in 1.33.
In a perfect world, I think we'd follow a similar pattern as regular API objects where the helper allows users to pick any version and it will handle conversion as needed. That way drivers could move as fast or as slow as they want as the set of API versions changes and would allow smoother migrations without any particular release updating the helper from one single API version to another.
I've added automatic conversion between API versions to https://github.com/kubernetes/kubernetes/pull/131246.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@nojnhuh: Is this issue still relevant?
We had the idea of "profiles" in the example driver, with each "profile" taking advantage of different Kubernetes features. What's the status regarding that?
I've only just started playing with profiles so I haven't gotten to this point yet, but I think that will cover this.
I'm thinking that if the dra-example-driver adds an NVIDIA MIG-like profile for partitionable devices, then the DRAPartitionableDevices feature gate would have to be set in the cluster when that profile is in use. Then if it's not set, the driver will detect dropped fields in the ResourceSlice at runtime and error out. The default profile continues to require only default DRA functionality, so users can keep using a bleeding-edge dra-example-driver on an older cluster as long as they use profiles compatible with that version of Kubernetes.
Let's keep this issue open though to make sure that pans out.
/remove-lifecycle rotten
The default profile continues to require only default DRA functionality, so users can keep using a bleeding-edge dra-example-driver on an older cluster as long as they use profiles compatible with that version of Kubernetes.
That matches my expectations.