mpich mpl/gpu/ze: add ipc cache limit enforcement and replacement policy

Pull Request Description

The IPC cache in ZE is currently unbound in size, which can result in unbound growth in memory usage in an application even if the application explicitly frees GPU memory via zeMemFree. Since entries in the IPC cache are never removed, references to sender-side GPU memory always exist, so the driver may not be able to free the memory.

This PR adds the ability to define a maximum size for the IPC cache in ZE. This is configurable via environment variable MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES. A least-recently-used model is used for determining which entry in the cache should be evicted once reaching the maximum size. Note that each rank has a cache for each remote device, so the max entries value applies to each of these caches.

This PR also makes a number of fixes to ensure all references related to the IPC mechanisms are properly cleaned up

Author Checklist

[x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
[x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
[ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
[x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

Jul 08 '24 21:07 abrooks98

Hello, I was trying to build this based off the instructions for Aurora here: https://github.com/pmodels/mpich/wiki/Using-MPICH-on-Aurora@ALCF since we were thinking it fixed a memory leak we saw and we wanted to test it out. However, when I tried to build it, it failed to build with the error:

> make
make  all-recursive
make[1]: Entering directory '/lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/mpich'
Making all in src/mpl
make[2]: Entering directory '/lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/mpich/src/mpl'
  CC       src/gpu/mpl_gpu_ze.lo
src/gpu/mpl_gpu_ze.c:1471:84: error: too few arguments to function call, expected 3, have 2
 1471 |             (MPL_ze_ipc_lru_entry_t *) MPL_calloc(1, sizeof(MPL_ze_ipc_lru_entry_t));
      |                                        ~~~~~~~~~~                                  ^
./include/mpl_trmem.h:456:21: note: 'MPL_calloc' declared here
  456 | static inline void *MPL_calloc(size_t nmemb, size_t size, MPL_memory_class memclass)
      |                     ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/gpu/mpl_gpu_ze.c:1485:80: error: too few arguments to function call, expected 3, have 2
 1485 |             (MPL_ze_ipc_lru_map_t *) MPL_calloc(1, sizeof(MPL_ze_ipc_lru_map_t));
      |                                      ~~~~~~~~~~                                ^

from what I see in the MPL_calloc definition it’s not using MPL_memory_class memclass so I just added a , MPL_MEM_OTHER on the end. But then there was another error:

src/gpu/mpl_gpu_ze.c:1512:19: error: no member named 'hh' in 'struct MPL_ze_ipc_lru_entry_t'
 1512 |         HASH_FIND(hh, ipc_lru_map, &ipc_buf, sizeof(void *), map_entry);
      |         ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:138:27: note: expanded from macro 'HASH_FIND'
  138 |     HASH_FIND_BYHASHVALUE(hh, head, keyptr, keylen, _hf_hashv, out);             \
      |     ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:127:40: note: expanded from macro 'HASH_FIND_BYHASHVALUE'
  127 |       HASH_FIND_IN_BKT((head)->hh.tbl, hh, (head)->hh.tbl->buckets[ _hf_bkt ], keyptr, keylen, hashval, out); \
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:726:16: note: expanded from macro 'HASH_FIND_IN_BKT'
  726 |     if ((out)->hh.hashv == (hashval) && (out)->hh.keylen == (keylen_in)) {       \
      |         ~~~~~  ^

This is unexpected since the inputs to HASH_FIND in the code above are MPL_ze_ipc_lru_map_t and there is a hh in that struct. https://github.com/abrooks98/mpich/blob/96e851ee5c2eaab82a975b1bd918d8eee517280f/src/mpl/src/gpu/mpl_gpu_ze.c#L172 . Did you see compile errors like this as well, or is there some other way that I should compile this branch? We'd love to test it, so any advice is appreciated!

Jul 17 '24 17:07 colleeneb

Hello, I was trying to build this based off the instructions for Aurora here: https://github.com/pmodels/mpich/wiki/Using-MPICH-on-Aurora@ALCF since we were thinking it fixed a memory leak we saw and we wanted to test it out. However, when I tried to build it, it failed to build with the error:

> make
make  all-recursive
make[1]: Entering directory '/lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/mpich'
Making all in src/mpl
make[2]: Entering directory '/lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/mpich/src/mpl'
  CC       src/gpu/mpl_gpu_ze.lo
src/gpu/mpl_gpu_ze.c:1471:84: error: too few arguments to function call, expected 3, have 2
 1471 |             (MPL_ze_ipc_lru_entry_t *) MPL_calloc(1, sizeof(MPL_ze_ipc_lru_entry_t));
      |                                        ~~~~~~~~~~                                  ^
./include/mpl_trmem.h:456:21: note: 'MPL_calloc' declared here
  456 | static inline void *MPL_calloc(size_t nmemb, size_t size, MPL_memory_class memclass)
      |                     ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/gpu/mpl_gpu_ze.c:1485:80: error: too few arguments to function call, expected 3, have 2
 1485 |             (MPL_ze_ipc_lru_map_t *) MPL_calloc(1, sizeof(MPL_ze_ipc_lru_map_t));
      |                                      ~~~~~~~~~~                                ^

from what I see in the MPL_calloc definition it’s not using MPL_memory_class memclass so I just added a , MPL_MEM_OTHER on the end. But then there was another error:

src/gpu/mpl_gpu_ze.c:1512:19: error: no member named 'hh' in 'struct MPL_ze_ipc_lru_entry_t'
 1512 |         HASH_FIND(hh, ipc_lru_map, &ipc_buf, sizeof(void *), map_entry);
      |         ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:138:27: note: expanded from macro 'HASH_FIND'
  138 |     HASH_FIND_BYHASHVALUE(hh, head, keyptr, keylen, _hf_hashv, out);             \
      |     ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:127:40: note: expanded from macro 'HASH_FIND_BYHASHVALUE'
  127 |       HASH_FIND_IN_BKT((head)->hh.tbl, hh, (head)->hh.tbl->buckets[ _hf_bkt ], keyptr, keylen, hashval, out); \
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/uthash.h:726:16: note: expanded from macro 'HASH_FIND_IN_BKT'
  726 |     if ((out)->hh.hashv == (hashval) && (out)->hh.keylen == (keylen_in)) {       \
      |         ~~~~~  ^

This is unexpected since the inputs to HASH_FIND in the code above are MPL_ze_ipc_lru_map_t and there is a hh in that struct. https://github.com/abrooks98/mpich/blob/96e851ee5c2eaab82a975b1bd918d8eee517280f/src/mpl/src/gpu/mpl_gpu_ze.c#L172 . Did you see compile errors like this as well, or is there some other way that I should compile this branch? We'd love to test it, so any advice is appreciated!

Hi @colleeneb , thanks for informing me of this issue. I neglected to keep this branch updated with my current development. I've updated it to include the compilation and other fixes.

In your testing, I would also highly recommend including the commit from #7065; we think both are necessary for resolving this memory issue. Simply cherry-pick the single commit on top of these changes.

During testing, you may need to do some tuning at runtime, via MPIR_CVAR_CH4_IPC_GPU_MAX_CACHE_ENTRIES, to control the number of entries in the cache. The default is 128, so you may start small (i.e. 8 or 16) to immediately determine if these changes solve the issue. Afterward, increasing this value may improve performance depending on the app behavior. Please let me know if you have any other questions or issues!

Jul 17 '24 18:07 abrooks98

Thanks! I'll retry soon!

Jul 17 '24 21:07 colleeneb

Thanks! It build this time indeed. For the record I built it with (and let me know if I missed something):

git clone -b ipc_cache_size [email protected]:abrooks98/mpich.git
cd mpich
git remote add zhenggb72 [email protected]:zhenggb72/mpich.git
git fetch zhenggb72
git cherry-pick 409baeb60540483e952e6c4623d326dca3a88592
git submodule update --init
./autogen.sh
./configure --prefix=$PWD/../install --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx
make -j16 install

I'm waiting for the application owners for the application which saw the memory issue to test, so fingers crossed!

Jul 17 '24 22:07 colleeneb

@zippylab tested this out (using the build above) and unfortunately saw the same memory growth as he described in https://github.com/pmodels/mpich/issues/6959 . Is there anything missing in the build above, or anything additional we should check to see if it's working as expected? Thanks!

Jul 19 '24 15:07 colleeneb

@zippylab tested this out (using the build above) and unfortunately saw the same memory growth as he described in #6959 . Is there anything missing in the build above, or anything additional we should check to see if it's working as expected? Thanks!

Thanks for letting me know.

I would recommend the following based on the latest testing:

Try the application with GPU IPC disabled.

You will need to make a new build, adding the option --with-ch4-shmmods=none. This will force all communication (even on-node) to go through OFI. If the app still fails with the memory growth pattern with this config, then the issue is unrelated to the changes in this PR and likely not caused by MPICH.

If the app does not fail in the previous step, I would then suggest switching back to your previous build and monitor the number of open file descriptors while the application is running.

If these changes are working as intended, then typically the number of FDs should stabilize once the cache is full. However, there are many factors which can increase the FD count, so it's not a perfect detection for these changes.

If a small reproducer can be provided, I can assist with debugging and determine if there are any changes needed for this PR. Without a reproducer, it's hard for me to say if anything else is missing, as I haven't yet been able to reproduce the behavior in #6959 on my own

Jul 19 '24 15:07 abrooks98

@abrooks98 As discussed with you and Renzo Bustamante, there's no small reproducer for this, and it's far from trivial to construct one because it's an alltoallv() pattern and only shows up when GPU data pointers are used for the message buffers. Do you have an alltoallv() unit test that uses GPU data pointers for the messages? I'm wondering if this functionality has been tested before in Aurora MPICH.

Jul 19 '24 17:07 zippylab

@abrooks98 As discussed with you and Renzo Bustamante, there's no small reproducer for this, and it's far from trivial to construct one because it's an alltoallv() pattern and only shows up when GPU data pointers are used for the message buffers. Do you have an alltoallv() unit test that uses GPU data pointers for the messages? I'm wondering if this functionality has been tested before in Aurora MPICH.

Hi @zippylab, thanks for clarifying. Unfortunately the test suite does not currently support alltoallv (or any vector-based collectives) for GPU buffers; It only supports the non-vector collectives. Though, I will make a point to discuss with the MPICH team on adding this support.

In the meantime, I would suggest following steps 1-2 I listed above, and separately I will reach back out to Renzo regarding testing the app available to him on my end. I will use that to validate the cache mechanism behavior and let you and @colleeneb know the status and if any additional changes are necessary.

Jul 19 '24 18:07 abrooks98

Hi @zippylab, thanks for clarifying. Unfortunately the test suite does not currently support alltoallv (or any vector-based collectives) for GPU buffers; It only supports the non-vector collectives. Though, I will make a point to discuss with the MPICH team on adding this support.

Note that the code path I'm executing in these most recent tests doesn't actually call MPI_Alltoallv(), but implements it manually using {MPI_Irecv(), MPI_Send(), MPI_Wait()}. There's a compile-time switch to change between the two approaches. So, the memory growth is not specific to the internals of the MPI_Alltoallv() implementation in Aurora MPICH. Earlier, I tried both code paths (actual MPI_Alltoallv() and the manual volley using {MPI_Irecv(), MPI_Send(), MPI_Wait()} with the default MPICH version, and both exhibited the same memory growth behavior. I'll re-test the actual MPI_Alltoallv() path with Colleen's patched MPICH and re-verify that it's behaving the same way.

In the meantime, I would suggest following steps 1-2 I listed above, and separately I will reach back out to Renzo regarding testing the app available to him on my end. I will use that to validate the cache mechanism behavior and let you and @colleeneb know the status and if any additional changes are necessary.

Colleen built another custom MPICH today using --with-ch4-shmmods=none (step 1 above). My XGC tests using that build so far have all worked correctly without memory growth. The runs were extremely slow, but I'm not 100% sure that was caused by the MPICH build and not caused by other slowness on Aurora this afternoon.

Jul 19 '24 23:07 zippylab

If the app does not fail in the previous step, I would then suggest switching back to your previous build and monitor the number of open file descriptors while the application is running. If these changes are working as intended, then typically the number of FDs should stabilize once the cache is full. However, there are many factors which can increase the FD count, so it's not a perfect detection for these changes.

@abrooks98 Tim monitored the number of open FD with the draft PR (which from testing still results in the out-of-memory issue) and with the version built with --with-ch4-shmmods=none (which from testing didn't show the out-of-memory issue). For the version built with --with-ch4-shmmods=none, the number of FD (from summing all lsof -p $pid where $pid is the pid of the application) increased from 2800 to 3050 and then stabilized around 3050 to 3088 for the rest of the application run. For the draft PR version, the number of FD increased the from roughly 2300 to 4800 the whole run and didn't stabilize. Maybe we didn't hit the cache limit? Is it worth it to try setting MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES to something?

Jul 26 '24 03:07 colleeneb

@abrooks98 Tim monitored the number of open FD with the draft PR (which from testing still results in the out-of-memory issue) and with the version built with --with-ch4-shmmods=none (which from testing didn't show the out-of-memory issue). For the version built with --with-ch4-shmmods=none, the number of FD (from summing all lsof -p $pid where $pid is the pid of the application) increased from 2800 to 3050 and then stabilized around 3050 to 3088 for the rest of the application run. For the draft PR version, the number of FD increased the from roughly 2300 to 4800 the whole run and didn't stabilize.

Thanks for the updates Colleen and Tim. It is good to confirm that --with-ch4-shmmods=none is working without error. This indicates our solution is on the right track.

Maybe we didn't hit the cache limit? Is it worth it to try setting MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES to something?

Yes I would recommend setting MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES=4 and monitoring the FD count. The default value of this CVAR is 128 (this is per rank per device), so it could take a long time to fill. At a low value like 4, if everything is properly cleaned up in this PR, then the FD count should stabilize like you saw with the --with-ch4-shmmods=none build. If it still does not stabilize, then we likely missed something that is preventing the GPU memory from freeing

Jul 26 '24 14:07 abrooks98

@abrooks98 I reran with export MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES=4, and the behavior was the same as without setting this environment variable: the FDs increased monotically (and the GPU memory consumed grew rapidly).

Jul 26 '24 17:07 zippylab

@zippylab and @colleeneb I just wanted to let you know that I am now able to run the XGC build from Renzo. I am working this week on debugging and fixing this PR. I'll let you know when I have updates so that you may test on your end

Jul 29 '24 16:07 abrooks98

@zippylab and @colleeneb I have made some progress today. I just pushed a number of changes which seem to fix the memory issues in some cases. I still have some debugging to do, especially for non-default scenarios, but please feel free to try these changes and let me know if they indeed resolve issues on your end. They will still require you to set MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES to a value lower than default. For reference, with a value of 2 running a single node, 6 rank test, I saw step 1 have ~7 GB gpu memory usage, and all subsequent steps having ~9.5 GB gpu memory usage with no increase in gpu memory usage.

Jul 29 '24 22:07 abrooks98

Thanks a lot! We'll test it!

Jul 29 '24 23:07 colleeneb

Previously I was cherry-picking 409baeb60540483e952e6c4623d326dca3a88592 from [email protected]:zhenggb72/mpich.git , but it looks like that was merged in -- do you still recommend pulling that in to test, or can we just build this PR with no cherry-picking? Thanks!

Jul 29 '24 23:07 colleeneb

Previously I was cherry-picking 409baeb60540483e952e6c4623d326dca3a88592 from [email protected]:zhenggb72/mpich.git , but it looks like that was merged in -- do you still recommend pulling that in to test, or can we just build this PR with no cherry-picking? Thanks!

Since the other PR has been merged, you can just build this PR now without cherry-picking

Jul 30 '24 00:07 abrooks98

I was able to build it, but it looks like if I try to run on multiple nodes (Even 2 nodes with 1 rank each), I get crashes. I put a reproducer below. I don't think it's related to this PR, but maybe something else that got pulled in? It looks like it's failing when doing a hwloc call. This didn't happen when I tried the older version of this PR. Do you have any ideas on how to handle this?

I built with:

rm -rf mpich_jul_29
git clone -b ipc_cache_size [email protected]:abrooks98/mpich.git mpich_jul_29
rm -rf install_jul_29
mkdir install_jul_29
cd mpich_jul_29
git submodule update --init
./autogen.sh
./configure --prefix=$PWD/../install_jul_29 --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx
make -j32 install

Thanks!

Code:

> cat m.cpp
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Finalize the MPI environment.
    MPI_Finalize();
}

Compile:

mpicxx -g m.cpp

Run:

> mpirun -n 2 -ppn 1 ./a.out
x4714c1s4b0n0.hostmgmt2714.cm.aurora.alcf.anl.gov: rank 0 died from signal 11 and dumped core

Backtrace from the corefile:

(gdb) bt
bt
#0  0x000014b7589ddfb0 in hwloc_get_obj_by_depth ()
   from /opt/aurora/24.086.0/spack/oneapi/0.7.0/install/2024.04.15.002/linux-sles15-x86_64/oneapi-2024.04.15.002/hwloc-2.9.2-gjlnyu7kyuu4stuhafckrsvibq3t77wz/lib/libhwloc.so.15
#1  0x000014b759b4de72 in MPIR_hwtopo_get_num_numa_nodes ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#2  0x000014b759bc5a6c in MPIDI_OFI_init_multi_nic ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#3  0x000014b759ba13d2 in MPIDI_OFI_init_local ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#4  0x000014b759b56c94 in MPID_Init ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#5  0x000014b759a85d71 in MPII_Init_thread ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#6  0x000014b759a85ba9 in MPIR_Init_impl ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#7  0x000014b75975c440 in PMPI_Init ()
   from /lus/flare/projects/Aurora_deployment/bertoni/projects/p01.mpich/install_jul_29/lib/libmpi.so.0
#8  0x000000000040186b in main (argc=1, argv=0x7ffd09299458) at m.cpp:6
(gdb) q

Jul 30 '24 22:07 colleeneb

I am wondering if the crashes are because the topology file from PMIx lacks the PCI device information needed for the nic assignment logic. We don't currently have a way to disable this at runtime. @abrooks98 maybe you can push a temporary patch for testing that compiles that code out?

Jul 31 '24 01:07 raffenet

@colleeneb I've added two new patches, one that I think will resolve memory issues in other scenarios (I still need to do some more testing on this) and another that I think will resolve the init failure you were seeing. Let me know if you face any other issues!

Jul 31 '24 15:07 abrooks98

The init failure was resolved, thanks! But it looks like the GPU memory and open FDs is still increasing with this PR (even with MPIR_CVAR_CH4_IPC_GPU_MAX_ENTRIES=2) unfortunately. I built as above in the previous comment -- am I missing something in the configure line?

Aug 01 '24 01:08 colleeneb

Marking this PR as ready for review.

I and @zippylab have done some testing with XGC and so far see no unbound memory growth. I also ran the GPU test suite and the results match with main. There are some failures in the gpu test suite, but I confirmed they also fail the same in main.

The failures I observed:

Intra-node
- bcast device -> shared - validation failure
- bcast shared -> device - validation failure
- getfence1 - validation failure
- putfence1 - validation failure
Inter-node
- sendrecv1 w/ pipeline=1 - assertion failure ofi_events.c:165 https://github.com/pmodels/mpich/blob/main/src/mpid/ch4/netmod/ofi/ofi_events.c#L165
- getfence1 - validation failure
- putfence1 - validation failure

Sep 13 '24 16:09 abrooks98

test:mpich/ch4/most test:mpich/ch3/most

Sep 13 '24 16:09 abrooks98

Latest change was to fix an issue when setting to disabled. Previous tests will not be affected since it is a non-default case.

ch4 tests: https://jenkins-pmrs.cels.anl.gov/view/mpich-review/job/mpich-review-ch4-ofi/4877/

ch3 tests: https://jenkins-pmrs.cels.anl.gov/view/mpich-review/job/mpich-review-ch3-tcp/2521/

Sep 13 '24 19:09 abrooks98

test:mpich/ch4/gpu/ofi ✅

https://jenkins-pmrs.cels.anl.gov/view/mpich-review/job/mpich-review-ch4-gpu-ofi/330/

timeout in am-only is unrelated

Sep 13 '24 19:09 abrooks98

LGTM

Thanks Hui! Please merge when you are ready

Sep 17 '24 16:09 abrooks98