demonstrating advanced features
We want to demonstrate advanced features like partitionable devices with the example driver. This begs the question: should we use some artificial, fake devices or mirror the behavior of some real vendor driver when it advertises real hardware?
On the one hand, we prefer to stay vendor-neutral. On the other hand, a real example is easier to understand and would make the example driver more useful for realistic benchmarking.
After some discussions at KubeCon and in https://kubernetes.slack.com/archives/C0409NGC1TK/p1743667181489199, here's a proposal. In the top-level README.md, we add a new section:
Configuration
Vendors are encouraged to work with the Kubernetes maintainers to enhance DRA for their use cases. When this leads to new features, extending the example driver such that it demonstrates those features by emulating a vendor driver for certain hardware is desirable. Later, adding novel usages of existing features may also be worth extending the example driver.
At the moment, the driver supports the following profiles:
- gpu (default): 8 generic GPUs per node. Works on Kubernetes >= 1.32.
- nvidia-mig: two NVIDIA A100 GPUs per node, with attributes that are the same as for real hardware. Works on Kubernetes >= 1.33.
- google-tpu: models multi-host devices. Works on Kubernetes >= 1.33.
Each deployment of the example driver uses exactly one profile and <profile name>.dra.example.com as driver name. To configure the profile, ... [TBD]. These profiles do not actually emulate any real hardware. Instead, they merely inject environment variables which mirror the devices that were allocated.
/assign @bg-chun
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
@bg-chun: are you still interested in working on this?
/remove lifecycle-state
/remove-lifecycle stale
@bg-chun If you're not already too deep into this, I'd like to get this started. At least as far as refactoring to introduce the concept of a "profile" with only the "gpu" one implemented.
@nojnhuh Yes, go ahead. I’m currently focused on a few internal works that will be released and open-sourced soon, so it’s a little tough to make extra time right now. However, I’ll try to take a look and keep track of your PRs.
Sounds good, thanks!
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Actively working on this!
/remove-lifecycle stale