GPU Memory Usage and MIG

Open mcahn opened this issue 8 months ago • 1 comments

Our cluster includes nodes that each have four 80 GB A100 GPUs, as well as older nodes that each have four 32 GB V100 GPUs. We are considering splitting up some of the 80 GB A100s using MIG (Multi-Instance GPU). However, looking at all the jobs run on these nodes this year, about 95% of the jobs used more than 40 GB of GPU memory. (This includes CryoSPARC, RELION, AlphaFold, and some other jobs).

This is surprising, because when we only had 32 GB V100 nodes, no jobs were seen to run out of GPU memory.

Is RELION able to detect how much GPU memory is available, and configure each job accordingly? For example, will it process fewer images at once if only 40 GB is available, and more images at once if 80 GB is available? That would explain why things worked on the V100s.

Thanks, Matthew

Apr 25 '25 19:04 mcahn

For example, will it process fewer images at once if only 40 GB is available, and more images at once if 80 GB is available?

No, the number of images processed in a batch is controlled by the "number of pooled particles" in the GUI (--pool argument). If a job runs successfully on a V100, you can safely limit A100's VRAM to 40 GB. Giving 80 GB does not make the job faster (unless you put more MPI processes on the GPU).

However, you should note a pitfall discussed in https://github.com/3dem/relion/issues/1074. In other words, you should not control CPU and GPU resource allocation within a single job running on a node. For example, if a user wants to run 2 MPI processes on a 80 GB GPU, the job scheduler should give full access to the whole of the GPU, instead of pinning one process to a half (40 GB) of the GPU and the other process to the other half (40 GB) of the GPU.

Apr 26 '25 05:04 biochem-fan