mpich Memory growth with GPU-aware MPICH on Intel PVC GPUs

Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).

Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.

This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:

Step 1:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 47.39/47.39/47.39GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 11.94/11.94/11.94GB (64.00GB total available), min=0, max=1
…
Step 5:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.27/56.27/56.27GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 35.79/35.79/35.79GB (64.00GB total available), min=0, max=1
…
Step 10:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.24/56.24/56.24GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 60.71/60.71/60.71GB (64.00GB total available), min=0, max=1
…
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com 1: terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos failed to allocate memory for label "sendbuf".  Allocation using MemorySpace named "SYCLDeviceUSM" failed with the following error:  Allocation of size 2.067 G failed because of an unknown error.  (The allocation mechanism was sycl::malloc_device().)

Apr 03 '24 18:04 zippylab

@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able to validate the solution

Apr 04 '24 20:04 abrooks98

@abrooks98 I don't have a small reproducer yet. Where we observed it is pretty deep down in XGC functionality, and involves a number of template instances as well as Kokkos views of more than one variety including unmanaged views. Constructing something simple to demonstrate it may take quite a bit of trial-and-error. I'll start working on it, but meanwhile it may be that @zhenggb72, one of the Intel people I've been working on this with, could help with validating the solution using XGC.

Apr 04 '24 22:04 zippylab

@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.

Apr 05 '24 01:04 zhenggb72