gpu-operator Driver build fails on AWS g5g.xlarge

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee
Kernel Version: 5.15.0-1041-aws
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.25.11
GPU Operator Version: 23.6.0

2. Issue or feature description

On AWS g5g.xlarge (smallest gpu node), the driver build fails because it is running out of system memory. It would maybe be possible to limit concurrency to a much smaller level, in order to be able to run on 8GB of memory.

3. Steps to reproduce the issue

Create EKS Cluster, setup gpu operator
Spawn g5g.xlarge
:boom:

4. Information to attach (optional if deemed irrelevant)

Is there already a way to limit concurrency in the nvcr.io/nvidia/driver container or is that not possible at the moment?

Aug 18 '23 04:08 martin31821

After digging a bit deeper, the root cause seems to be in the nvidia-driver script, where _create_driver_package() contains make -s -j SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null, essentially starting all compile jobs at once and thus overloading the small node.

I'll try to overwrite the script to limit concurrency here.

Aug 18 '23 04:08 martin31821

Thanks for reporting this @martin31821 we will look into making max threads as configurable for low memory systems.

Aug 29 '23 06:08 shivamerla

@shivamerla created a ~~draft~~ PR at the driver container images project level as a start: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/285.

I'm not sure if we might also want to update this operator to be able to automatically react via NFD data to cases like gpu cores outweighing available mem GB or something similar when automatically generating driver spec and passing in some determined --max-threads args, or just leave it up to user-managed driver CRD. I'm happy to help out further at this level as well, but thought this might be a good checkpoint to discuss.

Nov 27 '23 20:11 rockholla

Updated PR after the move for the relevant repo to github: https://github.com/NVIDIA/GPU-Driver-Container/pull/6

Mar 05 '24 20:03 rockholla