intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices
This PR depends on https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1114 + https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1116 PRs being merged first.
It offers two alternatives for integrating support for installing GPU device plugin to a cluster with fake device support. Either by adding separate gpu_fakedev YAML directory which configures GPU plugin differently from start, or by providing kustomization overlay for existing GPU plugin YAML files.
(Once there's consensus which approach is better, I'll remove the other one.)
I'd prefer adding just an overlay for existing GPU plugin YAMLs, but I have not been able to force kustomize to order init containers as needed. Any advice?
PS. The whole picture and initial review comments are in the RFC PR https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1104, from which this is split off.
Updated the kustomize change to pass CI test. If kustomize cannot be made to work (order init containers as needed) that change (along with first 2 commits) is redundant.
How the new YAMLs work:
Fake device generator (with suitable fake configuration file) is run as the first init container, and (existing) NFD hook init container produces NFD feature file instead installing hook binary to host. Sysfs and devfs content created by the fake device generator is mounted both to NFD hook init container and GPU plugin container.
Devfs needs to be a real host directory so that container runtime can mount the assigned (fake) device file to a (fake) workload, and workload can then report to test runner which device was assigned to it. Because current GPU plugin + container runtime require host and GPU plugin container paths to match, and container runtime does not allow creating files to regular sysfs/devfs paths, different paths need to be used for the fake ones.
If kustomize init container order cannot be forced so that fake device generator runs first, init container writing NFD features file could be changed into a regular container running forever (that's what needs to be done eventually any way, when NFD drops hook support).
So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?
So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?
Two months have passed. Any comments on this?
So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?
Two months have passed. Any comments on this?
Additional month passed. Any comments?
Note: fakedev-exporter project relies on this, as documented here: https://github.com/intel/fakedev-exporter/blob/main/deployments/README.md
Additional month passed. Any comments?
is it still WIP?
Additional month passed. Any comments?
is it still WIP?
It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?
I can then drop the WIP prefix and other alternative.
Additional month passed. Any comments?
is it still WIP?
It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?
I can then drop the WIP prefix and other alternative.
Oh OK, I've not looked at it at all since it says WIP :-)
+1 for the overlay base approach
@mythi Ok, I split the changes for a new gpu_fakedev kustomize dir to a separate branch:
https://github.com/eero-t/intel-device-plugins-for-kubernetes/commits/kustomize-gpu_fakedev
Merged the remaining commits to 2 and rebased it to latest main.
Except for the dropped large commit (+ a typo fix), content is same as earlier.
EDIT:
Pros for standalone daemonset:
- Easy to specify init container order
- Easy to rename existing things
Pros for kustomize overlay:
- Automatically tracks GPU plugin deployment changes
- Easier to take advantage of other GPU plugin overlays
Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?
Any comments on the
nfd-source-hooks->nfd-featuresname change, would you e.g. rather have it as a separate PR?
It's OK like this.
Rebased this to main to drop merged https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1327.
I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.
I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.
do we need to wait for the image?
do we need to wait for the image?
IMHO there's not much point in merging kustomize files that rely on non-existing image. Users are unlikely to do their own builds and kustomize the kustomize in this PR to point to their own registry...
PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.
IMHO there's not much point in merging kustomize files that rely on non-existing image.
We have several deployments in the repo without having the images available so this would not be an exception. I don't mind keeping this open but I think it's OK to merge as well.
PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.
NFD & GPU plugin have moved from hooks to feature files, so this needs to be updated. I'm not sure when I'll have time for that though.