leapfrogai spike: NVIDIA GPU operator Zarf package

LFAI delivery requires a production-ready NVIDIA GPU operator Zarf package that will bootstrap a containerized version of the necessary NVIDIA CUDA drivers, container toolkit, feature discovery and device plugin components to enable generative AI and ML applications to use NVIDIA GPUs from a Kubernetes cluster.

[ ] How do I prepare an air-gappable Zarf package that contains the NVIDIA GPU operator?
[ ] How do I setup the NVIDIA GPU operator to be configurable at deploy time?
- [ ] Multi-instance GPU (logical separation of GPU resources)?
- [ ] Time slicing (shared GPU loading and usage)?
- [ ] Distributed node resource load balancing configuration?
[ ] How and where do I consistently test this on K3D to make sure it works?
[ ] How and where do I consistently test this on RKE2 to make sure it works?
[ ] How do I integrate this back into the LFAI infrastructure UDS bundle in issue #317

See additional NVIDIA GPU operator context here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html

Mar 26 '24 14:03 justinthelaw

Some Defense Unicorns related resources:

https://github.com/justinthelaw/k3d-gpu-support
https://github.com/defenseunicorns/uds-prod-infrastructure
https://github.com/defenseunicorns/zarf-package-k3d-airgap

Mar 26 '24 14:03 justinthelaw

Commenting for personal tracking- Part of this spike should involve evaluating creating our own version of this repo/container that we publish from our org to use.

Apr 04 '24 18:04 YrrepNoj

This will be tracked via the following PR: https://github.com/justinthelaw/uds-rke2/pull/39

Jun 18 '24 16:06 justinthelaw

PR in previous comment is the tracking PR that is tied to a Delivery issue.

Jul 11 '24 18:07 justinthelaw