Out of memory issue on Polaris due to CUDA pinned memory on Polaris
Describe the bug Runs stopped with the following error
cudaAssert: cudaErrorMemoryAllocation out of memory, file /home/yeluo/opt/qmcpack/src/Platforms/CUDA/MemManageCUDA.hpp, line 74
when calling cudaHostRegister. However, host memory usage is way below available DDR capacity.
To Reproduce Steps to reproduce the behavior:
- all code release with DiracDeterminantBatched
- NiO performance benchmark a64 with 2048 walkers per rank.
- Running 2-4 ranks per node fails. 1 MPI rank works
- Each MPI rank sees all 4 GPUs.
Expected behavior The simulation should run with 1-4 ranks.
System: ALCF Polaris
Additional context I injected counters to count the peak amount of registered host memory (pinned memory) segments. 1 MPI rank run completed with max at ~34k 2 MPI rank run hit error at max ~32k per rank 3 MPI rank run hit error at max ~21k per rank 4 MPI rank run hit error at max ~16k per rank
a. There is a cap at around 65536 magic number. My guess vm.max_map_count=65530
b. It seems MPI (Cray MPICH) related. Likely due to the notorious XPMEM.
c. workaround exposing one GPU per rank made all cases to run.
Long term solution from our side. We need to to bulk allocation/registration and views instead of doing that per walker.
Before doing any work on this I would like to understand how significant an issue it is for actual science runs. Then we can discuss and assign a priority, decide on triage vs redesign etc.. We have known that having this large number of pinned regions is not ideal for some years.
The gauging parameter is walker_per_rank since the number of maps is proportional to it.
The above 1-4 MPI rank tests has 2048 walkers per rank. Only small problem sizes can have such high walker count and usually the workaround mentioned above is sufficient to unblock production runs. Thus there is no need to assign this issue high priority. I put out this issue to raise awareness and expose the workaround.
Long term, we can make better arrangement of pinned memory and we already have multi-walker resource infrastructure. New feature should be developed with proper handling of pinned memory. Old feature can be fixed when we revisit affected areas. Getting rid of legacy driver definitely makes fixes simple since legacy driver naturally doesn't follow batched driver design.
Update. I can reproduce the issue without MPI. I built qmcpack without MPI and launch it 4 times manual.
NUM=0; CUDA_VISIBLE_DEVICES=0,1,2,3 numactl -N $NUM $my_path/qmcpack --enable-timers=fine $file_prefix.xml >& $NUM.log &
NUM=1; CUDA_VISIBLE_DEVICES=1,2,3,0 numactl -N $NUM $my_path/qmcpack --enable-timers=fine $file_prefix.xml >& $NUM.log &
NUM=2; CUDA_VISIBLE_DEVICES=2,3,0,1 numactl -N $NUM $my_path/qmcpack --enable-timers=fine $file_prefix.xml >& $NUM.log &
NUM=3; CUDA_VISIBLE_DEVICES=3,0,1,2 numactl -N $NUM $my_path/qmcpack --enable-timers=fine $file_prefix.xml >& $NUM.log &
So I feel it is between OS and cudaHostRegister.
FYI. I posted the issue to NVIDIA forum with a minimal reproducer https://forums.developer.nvidia.com/t/cudahostregister-returns-cudaerrormemoryallocation-out-of-memory-in-runs-on-a-multi-gpu-node/337793/6?u=xw111luoye