dra-example-driver demonstrating advanced features

We want to demonstrate advanced features like partitionable devices with the example driver. This begs the question: should we use some artificial, fake devices or mirror the behavior of some real vendor driver when it advertises real hardware?

On the one hand, we prefer to stay vendor-neutral. On the other hand, a real example is easier to understand and would make the example driver more useful for realistic benchmarking.

After some discussions at KubeCon and in https://kubernetes.slack.com/archives/C0409NGC1TK/p1743667181489199, here's a proposal. In the top-level README.md, we add a new section:

Configuration

Vendors are encouraged to work with the Kubernetes maintainers to enhance DRA for their use cases. When this leads to new features, extending the example driver such that it demonstrates those features by emulating a vendor driver for certain hardware is desirable. Later, adding novel usages of existing features may also be worth extending the example driver.

At the moment, the driver supports the following profiles:

gpu (default): 8 generic GPUs per node. Works on Kubernetes >= 1.32.
nvidia-mig: two NVIDIA A100 GPUs per node, with attributes that are the same as for real hardware. Works on Kubernetes >= 1.33.
google-tpu: models multi-host devices. Works on Kubernetes >= 1.33.

Each deployment of the example driver uses exactly one profile and <profile name>.dra.example.com as driver name. To configure the profile, ... [TBD]. These profiles do not actually emulate any real hardware. Instead, they merely inject environment variables which mirror the devices that were allocated.

/assign @bg-chun

Apr 07 '25 13:04 pohly

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 06 '25 14:07 k8s-triage-robot

@bg-chun: are you still interested in working on this?

Jul 06 '25 15:07 pohly

/remove lifecycle-state

Jul 06 '25 15:07 pohly

/remove-lifecycle stale

Jul 06 '25 15:07 pohly

@bg-chun If you're not already too deep into this, I'd like to get this started. At least as far as refactoring to introduce the concept of a "profile" with only the "gpu" one implemented.

Aug 20 '25 21:08 nojnhuh

@nojnhuh Yes, go ahead. I’m currently focused on a few internal works that will be released and open-sourced soon, so it’s a little tough to make extra time right now. However, I’ll try to take a look and keep track of your PRs.

Aug 20 '25 21:08 bg-chun

Sounds good, thanks!

/assign

Aug 20 '25 21:08 nojnhuh

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 18 '25 21:11 k8s-triage-robot

Actively working on this!

/remove-lifecycle stale

Nov 18 '25 21:11 nojnhuh