Memory growth with GPU-aware MPICH on Intel PVC GPUs
Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).
Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.
This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:
Step 1:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 47.39/47.39/47.39GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 11.94/11.94/11.94GB (64.00GB total available), min=0, max=1
…
Step 5:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.27/56.27/56.27GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 35.79/35.79/35.79GB (64.00GB total available), min=0, max=1
…
Step 10:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.24/56.24/56.24GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 60.71/60.71/60.71GB (64.00GB total available), min=0, max=1
…
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com 1: terminate called after throwing an instance of 'std::runtime_error'
what(): Kokkos failed to allocate memory for label "sendbuf". Allocation using MemorySpace named "SYCLDeviceUSM" failed with the following error: Allocation of size 2.067 G failed because of an unknown error. (The allocation mechanism was sycl::malloc_device().)
@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able to validate the solution
@abrooks98 I don't have a small reproducer yet. Where we observed it is pretty deep down in XGC functionality, and involves a number of template instances as well as Kokkos views of more than one variety including unmanaged views. Constructing something simple to demonstrate it may take quite a bit of trial-and-error. I'll start working on it, but meanwhile it may be that @zhenggb72, one of the Intel people I've been working on this with, could help with validating the solution using XGC.
@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.