spike: NVIDIA GPU operator Zarf package
LFAI delivery requires a production-ready NVIDIA GPU operator Zarf package that will bootstrap a containerized version of the necessary NVIDIA CUDA drivers, container toolkit, feature discovery and device plugin components to enable generative AI and ML applications to use NVIDIA GPUs from a Kubernetes cluster.
- [ ] How do I prepare an air-gappable Zarf package that contains the NVIDIA GPU operator?
- [ ] How do I setup the NVIDIA GPU operator to be configurable at deploy time?
- [ ] Multi-instance GPU (logical separation of GPU resources)?
- [ ] Time slicing (shared GPU loading and usage)?
- [ ] Distributed node resource load balancing configuration?
- [ ] How and where do I consistently test this on K3D to make sure it works?
- [ ] How and where do I consistently test this on RKE2 to make sure it works?
- [ ] How do I integrate this back into the LFAI infrastructure UDS bundle in issue #317
See additional NVIDIA GPU operator context here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html
Some Defense Unicorns related resources:
- https://github.com/justinthelaw/k3d-gpu-support
- https://github.com/defenseunicorns/uds-prod-infrastructure
- https://github.com/defenseunicorns/zarf-package-k3d-airgap
Commenting for personal tracking- Part of this spike should involve evaluating creating our own version of this repo/container that we publish from our org to use.
This will be tracked via the following PR: https://github.com/justinthelaw/uds-rke2/pull/39
PR in previous comment is the tracking PR that is tied to a Delivery issue.