ompi [Cuda][Rocm?] memory leak in Ipc cache handling

Background information

Using openmpi direct GPU communication with CUDA, a memory growth is observed throughout the duration of the run. After tracing the issue using the ompi logs --mca mpi_common_cuda_verbose 10 there appears to be a mismatch between the calls to cuIpcOpenMemHandle and cuIpcCloseMemHandle (where cuIpcCloseMemHandle is rarely called). This behavior seems largely unaffected by various mca options, as if opal is losing tracks of some allocations.

A similar memory growth patern is observed on HPE Cray EX235a nodes

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

$ ompi_info 
                 Package: Debian OpenMPI
                Open MPI: 4.1.4
  Open MPI repo revision: v4.1.4
   Open MPI release date: May 26, 2022
                Open RTE: 4.1.4
  Open RTE repo revision: v4.1.4
   Open RTE release date: May 26, 2022
                    OPAL: 4.1.4
      OPAL repo revision: v4.1.4
       OPAL release date: May 26, 2022
                 MPI API: 3.1.0
            Ident string: 4.1.4
                  Prefix: /usr
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: hostname
           Configured by: username
           Configured on: Wed Oct 12 11:52:34 UTC 2022
          Configure host: hostname
  Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
                          '--includedir=${prefix}/include'
                          '--mandir=${prefix}/share/man'
                          '--infodir=${prefix}/share/info'
                          '--sysconfdir=/etc' '--localstatedir=/var'
                          '--disable-option-checking'
                          '--disable-silent-rules'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu'
                          '--runstatedir=/run' '--disable-maintainer-mode'
                          '--disable-dependency-tracking'
                          '--disable-silent-rules'
                          '--disable-wrapper-runpath'
                          '--with-package-string=Debian OpenMPI'
                          '--with-verbs' '--with-libfabric' '--with-psm'
                          '--with-psm2' '--with-ucx'
                          '--with-pmix=/usr/lib/x86_64-linux-gnu/pmix2'
                          '--with-jdk-dir=/usr/lib/jvm/default-java'
                          '--enable-mpi-java'
                          '--enable-opal-btl-usnic-unit-tests'
                          '--with-libevent=external' '--with-hwloc=external'
                          '--disable-silent-rules' '--enable-mpi-cxx'
                          '--enable-ipv6' '--with-devel-headers'
                          '--with-slurm' '--with-cuda=/usr/lib/cuda'
                          '--with-sge' '--without-tm'
                          '--sysconfdir=/etc/openmpi'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu/openmpi/lib'
                          '--includedir=${prefix}/lib/x86_64-linux-gnu/openmpi/include'

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

apt package on Debian 6.1.99-1.

Please describe the system on which you are running

Operating system/version: Linux dgx 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux
Computer hardware: Nvidia DGX workstation, CPU : epyc 7742 64c, GPUs : 4x A100-SXM4-40Gb
Network type: None

Details of the problem

The memory growth is observed during a section of the code using non-blocking direct GPU communications (Isend, Irecv) on CUDA memory.

Currently throughout the duration of a run the GPU memory usage is growing until crash due to CUDA_OUT_OF_MEMORY. Typically the evolution of the memory usage of the code when graphed looks like this : 374612504-dbf6801f-14a8-429c-ba20-bc491e23108b

When tracing the issue i stumbled on this old post on the nvidia forum https://forums.developer.nvidia.com/t/memory-increase-in-gpu-aware-non-blocking-mpi-communications/275634/4, which pointed toward cuIpc handling within openmpi.

Running the same test with --mca mpi_common_cuda_verbose 10, I traced instances of cuIpcOpenMemHandle and cuIpcCloseMemHandle to follow the memory usage evolution which matches with the observed memory growth.

I tried running the following test cases:

mpirun --mca mpi_common_cuda_verbose 10 \
	-n 4 <application> \
	2> out10_ompi_default
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_empty_cache 1\
	-n 4 <application> \
	2> out10_ompi_empty_cache
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_size_limit 100000\
	-n 4 <application> \
	2> out10_ompi_szlim100000
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_empty_cache 1  --mca mpool_rgpusm_rcache_size_limit 100000\
	-n 4 <application> \
	2> out10_ompi_empty_cache_szlim100000

If we plot the memory evolution traced from calls to cuIpcOpenMemHandle and cuIpcCloseMemHandle we get the following: Figure_20

Large communications in the beginning of the run are indeed freed correctly, however smaller communications does not appeared to be freed until the call to MPI_Finalize.

Lastly if we set --mca btl_smcuda_use_cuda_ipc 0 no memory leaks are observed confirming the issue.

So far such behavior was reproduced with:

openmpi 4.1.4 debian
ucx-1.15.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.16.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.17.0.tar.gz + openmpi-4.1.6.tar.gz

I'm also looking for hotfixes, since this issue is likely to impact us on many supercomputers.

Oct 08 '24 17:10 tdavidcl

the cudaIPC does not allocate memory but instead maps the memory of another GPU onto the local process memory space (for GPU access). Thus, while the memory use seems to increase, it is mostly an accountability quirk, the amount of accessible memory increases but not the physical memory used.

For performance reasons it is not uncommon to cache the IPC handles such that no call to cuIpcCloseMemHandle is made eagerly. These handles will be released either when the owner process free its GPU bound memory or upon finalize.

Oct 08 '24 20:10 bosilca

I agree that shouldn’t be an issue in principle, however when I check with nvidia smi the actual memory usage is growing by similar amount to the active cuIpc handles, and weirdly enough disabling the ipc remove the leak.

For precision the leak is observed using nvidia-smi initially, it just happens to be correlated in amount and location with cuIpc. And disabling ipc in OpenMPi resolve the issue, so while I agree that it shouldn’t be the case, for some odd reasons here cuIpc seems to be related to the issue.

Also it is important to note that the memory used for the communication is systematically freed after the communication step, therefore the memory exposed to cuIpc should be freed and the handle released.

Oct 08 '24 21:10 tdavidcl

I managed to get somewhat of a reproducer (in sycl though but it is transparent to cuda). Here is the end of the output, clearly the programm memory is unchanged, however the device memory usage gets maxed and create a CUDA_OUT_OF_MEMORY error.

Wed Oct  9 15:39:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:01:00.0 Off |                    0 |
| N/A   55C    P0              90W / 275W |  27496MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:47:00.0 Off |                    0 |
| N/A   54C    P0              90W / 275W |  32518MiB / 40960MiB |     22%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off | 00000000:81:00.0 Off |                    0 |
| N/A   53C    P0              68W / 275W |  37868MiB / 40960MiB |     37%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display             Off | 00000000:C1:00.0 Off |                  N/A |
| 35%   46C    P8              N/A /  50W |     24MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off | 00000000:C2:00.0 Off |                    0 |
| N/A   54C    P0              88W / 275W |  24636MiB / 40960MiB |     39%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1222329      C   ./a.out                                     414MiB |
|    1   N/A  N/A   1222331      C   ./a.out                                     414MiB |
|    2   N/A  N/A   1222336      C   ./a.out                                     414MiB |
|    3   N/A  N/A      4572      G   /usr/lib/xorg/Xorg                           20MiB |
|    4   N/A  N/A   1222341      C   ./a.out                                     414MiB |
+---------------------------------------------------------------------------------------+
 -- alloc
 -- alloc
 -- alloc
[AdaptiveCpp Error] from /local/tdavidcl/Shamrock/build_acpp_cuda/.env/acpp-git/src/runtime/cuda/cuda_allocator.cpp:31 @ allocate(): cuda_allocator: cudaMalloc() failed (error code = CUDA:2)

Here is the source code :

// ./.env/acpp-installdir/bin/acpp -O3 -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -lmpi test.cpp && mpirun -n 4 ./a.out

#include <sycl/sycl.hpp>
#include <mpi.h>
#include <random>
#include <stdio.h>

#define WSIZE 4

std::vector<sycl::device> get_sycl_device_list() {
    std::vector<sycl::device> devs;
    const auto &Platforms = sycl::platform::get_platforms();
    for (const auto &Platform : Platforms) {
        const auto &Devices = Platform.get_devices();
        for (const auto &Device : Devices) {
            devs.push_back(Device);
        }
    }
    return devs;
}

sycl::queue get_queue(int local_rank) {

    auto d = get_sycl_device_list();

    return sycl::queue{d[local_rank + 1]};
}

void print_nvsmi(int world_rank) {
    if (world_rank == 0) {
        system("nvidia-smi");
    }
}

void wait() { system("sleep 0.1"); }

void run_the_test(int world_rank, int world_size, sycl::queue &q) {
    std::mt19937 gen(111);
    std::uniform_int_distribution<> distrib(1, 500000000);

    for (int tcount = 0; tcount < 1000; tcount++) {

        std::vector<std::array<int, 3>> comm_map = {
            // sender receiver size
            {0, 1, distrib(gen)},
            {0, 2, distrib(gen)},
            {0, 3, distrib(gen)},
            {1, 0, distrib(gen)},
            {1, 2, distrib(gen)},
            {1, 3, distrib(gen)},
            {2, 0, distrib(gen)},
            {2, 1, distrib(gen)},
            {2, 3, distrib(gen)},
            {3, 1, distrib(gen)},
            {3, 2, distrib(gen)},
            {3, 3, distrib(gen)},
        };

        std::vector<char *> send_msgs;
        printf(" -- alloc\n");
        for (int i = 0; i < comm_map.size(); i++) {
            int len = comm_map[i][2];
            send_msgs.push_back(sycl::malloc_device<char>(len, q));
        }

        std::vector<MPI_Request> rqs;

        printf(" -- comm\n");
        for (int i = 0; i < comm_map.size(); i++) {

            int tag      = i;
            int sender   = comm_map[i][0];
            int receiver = comm_map[i][1];
            int len      = comm_map[i][2];
            char *ptr    = send_msgs[i];

            if (sender == world_rank) {

                rqs.push_back(MPI_Request{});
                int rq_index = rqs.size() - 1;
                auto &rq     = rqs[rq_index];

                MPI_Isend(ptr, len, MPI_BYTE, receiver, tag, MPI_COMM_WORLD, &rq);
            }
        }

        for (int i = 0; i < comm_map.size(); i++) {

            int tag      = i;
            int sender   = comm_map[i][0];
            int receiver = comm_map[i][1];
            int len      = comm_map[i][2];
            char *ptr    = send_msgs[i];

            if (receiver == world_rank) {

                rqs.push_back(MPI_Request{});
                int rq_index = rqs.size() - 1;
                auto &rq     = rqs[rq_index];

                MPI_Irecv(ptr, len, MPI_BYTE, sender, tag, MPI_COMM_WORLD, &rq);
            }
        }

        std::vector<MPI_Status> st_lst(rqs.size());
        MPI_Waitall(rqs.size(), rqs.data(), st_lst.data());

        printf(" -- free\n");
        for (int i = 0; i < comm_map.size(); i++) {
            sycl::free(send_msgs[i], q);
        }
        wait();

        print_nvsmi(world_rank);

        wait();
    }
}

int main(void) {

    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    if (WSIZE != world_size) {
        printf("the world size should be 4");
        return 1;
    }

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf(
        "Hello world from processor %s, rank %d out of %d processors\n",
        processor_name,
        world_rank,
        world_size);

    sycl::queue q = get_queue(world_rank);

    run_the_test(world_rank, world_size, q);

    // Finalize the MPI environment.
    MPI_Finalize();
}

It is important to note here that using a fixed size for the communications fixes the issue (the memory usage saturates around +600MiB after a few loop), whereas using instead a random size result in continuous growth until crash.

Oct 09 '24 13:10 tdavidcl

@tdavidcl thanks for reporting.

I'm seeing the exact problem using my own application using the same kind of communication pattern using MPI_Isend/MPI_Irecv/MPI_Waitall with cuda-aware GPU-to-GPU communications : GPU memory slowly grows up to CUDA_OUT_OF_MEMORY error.

Using Nvidia's compute-sanitizer with options -check-device-heap yes --show-backtrace yes --leak-check full we can get the full call stack, e.g.:

========= Leaked 31,850,624 bytes at 0x14a930000000
=========     Saved host backtrace up to driver entry point at allocation time
=========     Host Frame: [0x292640]
=========                in /lib64/libcuda.so.1
=========     Host Frame:uct_cuda_ipc_map_memhandle in cuda_ipc/cuda_ipc_cache.c:278 [0xd1fd]
=========                in /ccc/products/ucx-1.15.0/system/default/lib/ucx/libuct_cuda.so.0
=========     Host Frame:uct_cuda_ipc_ep_get_zcopy in cuda_ipc/cuda_ipc_ep.c:146 [0xbaeb]
=========                in /ccc/products/ucx-1.15.0/system/default/lib/ucx/libuct_cuda.so.0
=========     Host Frame:ucp_rndv_progress_rma_zcopy_common in rndv/rndv.c:586 [0x2bd5ed]
=========                in
=========     Host Frame:ucp_rndv_req_send_rma_get in rndv/rndv.c:954 [0x2be4bb]
=========                in
=========     Host Frame:ucp_rndv_receive in rndv/rndv.c:1684 [0x2c09ac]
=========                in    
=========     Host Frame:ucp_tag_rndv_process_rts in tag/tag_rndv.c:46 [0x2d4e6c]
=========                in 
=========     Host Frame:uct_mm_iface_progress in sm/mm/base/mm_iface.c:379 [0x184bb]
=========                in /ccc/products/ucx-1.15.0/system/default/lib/libuct.so.0
=========     Host Frame:ucp_worker_progress in core/ucp_worker.c:2889 [0xffffffffbb1ff709]
=========                in
=========     Host Frame:opal_progress in runtime/opal_progress.c:231 [0x2b3cb]
=========                in /ccc/products/openmpi-4.1.6.15/gcc--11.1.0/default/lib/libopen-pal.so.40
=========     Host Frame:ompi_request_default_wait_all in request/req_wait.c:234 [0x53e94]
=========                in /ccc/products/openmpi-4.1.6.15/gcc--11.1.0/default/lib/libmpi.so.40
=========     Host Frame:PMPI_Waitall in /tmp/openmpi/4.1.6.15/GCCcore-11.1.0cuda-12.2/openmpi-4.1.6.15.0/ompi/mpi/c/profile/pwaitall.c:80 [0x9553e]
=========                in /ccc/products/openmpi-4.1.6.15/gcc--11.1.0/default/lib/libmpi.so.40
=========     Host Frame:kanop::MeshGhostsExchanger<3, double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::do_mpi_send_recv_inplace(kanop::DataArrayGhostedBlock<3, double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >) in /ccc/dsku/nfs-server/user/cont001/ocre/kestenerp/WORKDIR/install/github/kanop/src/shared/MeshGhostsExchanger.h:746 [0x1ac912]

Disabling cuda ipc, as suggested above, makes the leak go away.

Dec 16 '24 21:12 pkestene

Any update on this issue ? This does affect production quite significantly ....

Mar 25 '25 14:03 tdavidcl

I suggest trying hpcx-mpi from inside the nvdia compiler, I don't see the memory issue if using that one.

Mar 25 '25 14:03 bellenlau