gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Driver build fails on AWS g5g.xlarge

Open martin31821 opened this issue 2 years ago • 4 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 20.04 for EKS (ARM) / ami-09b6385a90c8d3cee
  • Kernel Version: 5.15.0-1041-aws
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd 1.7.2
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS 1.25.11
  • GPU Operator Version: 23.6.0

2. Issue or feature description

On AWS g5g.xlarge (smallest gpu node), the driver build fails because it is running out of system memory. It would maybe be possible to limit concurrency to a much smaller level, in order to be able to run on 8GB of memory.

3. Steps to reproduce the issue

  1. Create EKS Cluster, setup gpu operator
  2. Spawn g5g.xlarge
  3. :boom:

4. Information to attach (optional if deemed irrelevant)

image

Is there already a way to limit concurrency in the nvcr.io/nvidia/driver container or is that not possible at the moment?

martin31821 avatar Aug 18 '23 04:08 martin31821

After digging a bit deeper, the root cause seems to be in the nvidia-driver script, where _create_driver_package() contains make -s -j SYSSRC=/lib/modules/${KERNEL_VERSION}/build nv-linux.o nv-modeset-linux.o > /dev/null, essentially starting all compile jobs at once and thus overloading the small node.

I'll try to overwrite the script to limit concurrency here.

martin31821 avatar Aug 18 '23 04:08 martin31821

Thanks for reporting this @martin31821 we will look into making max threads as configurable for low memory systems.

shivamerla avatar Aug 29 '23 06:08 shivamerla

@shivamerla created a ~~draft~~ PR at the driver container images project level as a start: https://gitlab.com/nvidia/container-images/driver/-/merge_requests/285.

I'm not sure if we might also want to update this operator to be able to automatically react via NFD data to cases like gpu cores outweighing available mem GB or something similar when automatically generating driver spec and passing in some determined --max-threads args, or just leave it up to user-managed driver CRD. I'm happy to help out further at this level as well, but thought this might be a good checkpoint to discuss.

rockholla avatar Nov 27 '23 20:11 rockholla

Updated PR after the move for the relevant repo to github: https://github.com/NVIDIA/GPU-Driver-Container/pull/6

rockholla avatar Mar 05 '24 20:03 rockholla