Driver build fails on AWS g5g.xlarge
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee
- Kernel Version: 5.15.0-1041-aws
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.25.11
- GPU Operator Version: 23.6.0
2. Issue or feature description
On AWS g5g.xlarge (smallest gpu node), the driver build fails because it is running out of system memory. It would maybe be possible to limit concurrency to a much smaller level, in order to be able to run on 8GB of memory.
3. Steps to reproduce the issue
- Create EKS Cluster, setup gpu operator
- Spawn g5g.xlarge
- :boom:
4. Information to attach (optional if deemed irrelevant)
Is there already a way to limit concurrency in the nvcr.io/nvidia/driver container or is that not possible at the moment?
After digging a bit deeper, the root cause seems to be in the nvidia-driver script, where _create_driver_package() contains make -s -j SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null, essentially starting all compile jobs at once and thus overloading the small node.
I'll try to overwrite the script to limit concurrency here.
Thanks for reporting this @martin31821 we will look into making max threads as configurable for low memory systems.
@shivamerla created a ~~draft~~ PR at the driver container images project level as a start: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/285.
I'm not sure if we might also want to update this operator to be able to automatically react via NFD data to cases like gpu cores outweighing available mem GB or something similar when automatically generating driver spec and passing in some determined --max-threads args, or just leave it up to user-managed driver CRD. I'm happy to help out further at this level as well, but thought this might be a good checkpoint to discuss.
Updated PR after the move for the relevant repo to github: https://github.com/NVIDIA/GPU-Driver-Container/pull/6