intel-device-plugins-for-kubernetes icon indicating copy to clipboard operation
intel-device-plugins-for-kubernetes copied to clipboard

WIP: Use intel-gpu-plugin with intel-gpu-fakedev generated devices

Open eero-t opened this issue 3 years ago • 16 comments

This PR depends on https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1114 + https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1116 PRs being merged first.

It offers two alternatives for integrating support for installing GPU device plugin to a cluster with fake device support. Either by adding separate gpu_fakedev YAML directory which configures GPU plugin differently from start, or by providing kustomization overlay for existing GPU plugin YAML files.

(Once there's consensus which approach is better, I'll remove the other one.)

I'd prefer adding just an overlay for existing GPU plugin YAMLs, but I have not been able to force kustomize to order init containers as needed. Any advice?

PS. The whole picture and initial review comments are in the RFC PR https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1104, from which this is split off.

eero-t avatar Aug 25 '22 17:08 eero-t

Updated the kustomize change to pass CI test. If kustomize cannot be made to work (order init containers as needed) that change (along with first 2 commits) is redundant.

eero-t avatar Aug 26 '22 16:08 eero-t

How the new YAMLs work:

Fake device generator (with suitable fake configuration file) is run as the first init container, and (existing) NFD hook init container produces NFD feature file instead installing hook binary to host. Sysfs and devfs content created by the fake device generator is mounted both to NFD hook init container and GPU plugin container.

Devfs needs to be a real host directory so that container runtime can mount the assigned (fake) device file to a (fake) workload, and workload can then report to test runner which device was assigned to it. Because current GPU plugin + container runtime require host and GPU plugin container paths to match, and container runtime does not allow creating files to regular sysfs/devfs paths, different paths need to be used for the fake ones.

eero-t avatar Aug 29 '22 08:08 eero-t

If kustomize init container order cannot be forced so that fake device generator runs first, init container writing NFD features file could be changed into a regular container running forever (that's what needs to be done eventually any way, when NFD drops hook support).

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

eero-t avatar Aug 29 '22 10:08 eero-t

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

eero-t avatar Oct 27 '22 15:10 eero-t

So, the remaining question is which alternative device-plugin project wants... Fake device example (for scalability testing) as an overlay for GPU plugin kustomize directory, or as its own gpu_fakedev/ kustomize directory?

Two months have passed. Any comments on this?

Additional month passed. Any comments?

Note: fakedev-exporter project relies on this, as documented here: https://github.com/intel/fakedev-exporter/blob/main/deployments/README.md

eero-t avatar Nov 23 '22 15:11 eero-t

Additional month passed. Any comments?

is it still WIP?

mythi avatar Dec 07 '22 17:12 mythi

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

eero-t avatar Dec 07 '22 17:12 eero-t

Additional month passed. Any comments?

is it still WIP?

It's waiting for you (I mean reviewers) to comment on which of the two implemented alternatives you'd rather have; it being a GPU plugin overlay, or its own "gpu_fakedev" kustomize directory?

I can then drop the WIP prefix and other alternative.

Oh OK, I've not looked at it at all since it says WIP :-)

mythi avatar Dec 07 '22 17:12 mythi

+1 for the overlay base approach

@mythi Ok, I split the changes for a new gpu_fakedev kustomize dir to a separate branch: https://github.com/eero-t/intel-device-plugins-for-kubernetes/commits/kustomize-gpu_fakedev

Merged the remaining commits to 2 and rebased it to latest main.

Except for the dropped large commit (+ a typo fix), content is same as earlier.

EDIT:

Pros for standalone daemonset:

  • Easy to specify init container order
  • Easy to rename existing things

Pros for kustomize overlay:

  • Automatically tracks GPU plugin deployment changes
  • Easier to take advantage of other GPU plugin overlays

eero-t avatar Dec 12 '22 14:12 eero-t

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

eero-t avatar Dec 12 '22 14:12 eero-t

Any comments on the nfd-source-hooks -> nfd-features name change, would you e.g. rather have it as a separate PR?

It's OK like this.

mythi avatar Dec 12 '22 16:12 mythi

Rebased this to main to drop merged https://github.com/intel/intel-device-plugins-for-kubernetes/pull/1327.

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

eero-t avatar Feb 13 '23 14:02 eero-t

I will do more testing on this & remove WIP after "intel/intel-gpu-fakedev" image referred in kustomize file is available.

do we need to wait for the image?

mythi avatar Apr 26 '23 12:04 mythi

do we need to wait for the image?

IMHO there's not much point in merging kustomize files that rely on non-existing image. Users are unlikely to do their own builds and kustomize the kustomize in this PR to point to their own registry...

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

eero-t avatar Apr 26 '23 14:04 eero-t

IMHO there's not much point in merging kustomize files that rely on non-existing image.

We have several deployments in the repo without having the images available so this would not be an exception. I don't mind keeping this open but I think it's OK to merge as well.

mythi avatar Apr 26 '23 18:04 mythi

PS. If this waits long enough, NFD will drop support for hooks, and GPU plugin needs to adapt to that. That would simplify this PR. Whereas if NFD moves further, from feature files to CRDs, more changes may be needed.

NFD & GPU plugin have moved from hooks to feature files, so this needs to be updated. I'm not sure when I'll have time for that though.

eero-t avatar Jan 02 '24 17:01 eero-t