ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Memory leak with ROCM-aware OpenMPI with UCX 1.17.0

Open StaticObserver opened this issue 8 months ago • 10 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5 v5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

source code from official webpage https://www.open-mpi.org/

configured with

 ./configure --prefix=/work/home/packages/openmpi/5.0.5 --with-ucx=/work/home/packages/ucx/1.17.0 --with-rocm=/public/software/compiler/dtk-23.10 --with-devel-headers --enable-mpi-fortran=no --enable-mca-no-build=btl-uct --enable-mpi1-compatibility

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 8.8 (Green Obsidian)
  • Computer hardware: AMD GPU architecture gfx906
  • Network type: not sure, infiniband?

Details of the problem

Memory leak whenever there's communications between gpu cards. Memory were used up very quickly. I wrote a minimal test program to reproduce the issue.

#include <mpi.h>
#include <Kokkos_Core.hpp>
#include <iostream>
#include <vector>
#include <string>

// To simplify, use the same size for sending and receiving
// Here, assume real_t = double, but could also be float, etc.
using real_t = double;

int main(int argc, char* argv[])
{
    // Initialize MPI
    MPI_Init(&argc, &argv);
    // Initialize Kokkos
    Kokkos::initialize(argc, argv);

    {
        int rank, size;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);

        // For simplicity, assume only rank 0 and rank 1 communicate with each other
        if (size < 2) {
            if (rank == 0) {
                std::cerr << "Please run with at least 2 MPI ranks.\n";
            }
            Kokkos::finalize();
            MPI_Finalize();
            return 0;
        }

        // If you want to control the number of test iterations and array size from the command line,
        // you can read the parameters here.
        // Default iteration count (ITER) and array size (N)
        int ITER = 1000;      // Number of communication loops
        int N    = 1024 * 64; // Array size (number of elements). Increase if you want to observe memory behavior

         if (argc > 1) {
            ITER = std::stoi(argv[1]);
        }
        if (argc > 2) {
            N = std::stoi(argv[2]);
        }

        // Print a hint message
        if (rank == 0) {
            std::cout << "Running test with ITER = " << ITER
                      << ", array size = " << N << "\n"
                      << "Monitor memory usage in another terminal, etc.\n"
                      << std::endl;
        }

        // To distinguish sending destinations and receiving sources:
        // rank 0 -> sends to rank 1, receives from rank 1
        // rank 1 -> sends to rank 0, receives from rank 0
        // Other ranks do not perform actual communication
        int sendRank = (rank == 0) ? 1 : 0;
        int recvRank = (rank == 0) ? 1 : 0;

        // Test loop
        for (int iter = 0; iter < ITER; ++iter) {
            // Allocate a new send buffer (sendBuf) on the GPU
            // Kokkos::View defaults to the Cuda space (if Kokkos_ENABLE_CUDA is enabled),
            // and for simplicity we do not consider the Layout here
            Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
                sendBuf("sendBuf", N);

            // If we need to receive, allocate a receive buffer (recvBuf)
            Kokkos::View<real_t*, Kokkos::DefaultExecutionSpace>
                recvBuf("recvBuf", N);

            // First, do a simple initialization for sendBuf (parallel loop)
            Kokkos::parallel_for("init_sendBuf", N, KOKKOS_LAMBDA(const int i){
              sendBuf(i) = static_cast<real_t>(rank + i * 0.001);
            });
            // If synchronization is needed (optional, depends on MPI+Kokkos implementation)
            Kokkos::fence();

            // MPI communication: rank 0 and rank 1 send and receive data from each other
            // For simplicity, here we use MPI_Sendrecv
            if (rank == 0 || rank == 1) {
                MPI_Sendrecv(
                    sendBuf.data(), // Send data pointer
                    N,              // Number of elements to send
                    MPI_DOUBLE,     // Data type
                    sendRank,       // Destination process
                    1234,           // Send tag
                    recvBuf.data(), // Receive data pointer
                    N,
                    MPI_DOUBLE,
                    recvRank,       // Source process
                    1234,           // Receive tag
                    MPI_COMM_WORLD, // Communicator
                    MPI_STATUS_IGNORE
                );
            }

            // (Optional check) verify successful reception
            // Only rank 0 or rank 1 need to check
            if ((iter % 100 == 0) && (rank == 0 || rank == 1)) {
                // Copy recvBuf to host to check
                auto recvHost = Kokkos::create_mirror_view(recvBuf);
                Kokkos::deep_copy(recvHost, recvBuf);

                // Print some debugging information
                // In practice, if the iteration count is large, it's better not to print frequently
                if (iter % 200 == 0) {
                      std::cout << "[Rank " << rank << "] Iter " << iter << ", recvBuf(0) = " << recvHost(0) << "\n";
                }
            }

            // At the end of the loop, sendBuf and recvBuf are no longer used
            // They will be freed when the braces end (due to Kokkos::View's RAII)
            // However, whether the MPI/GPU driver immediately reclaims the IPC handle requires monitoring
        } // end for(ITER)

        if (rank == 0) {
            std::cout << "Test finished. Check if GPU memory usage grew abnormally.\n";
        }
    }

    Kokkos::finalize();
    MPI_Finalize();

    return 0;
}

                                     

I compiled the program with cmake, a CMakeLists.txt could be

cmake_minimum_required(VERSION 3.10)
project(TestIPCIssue LANGUAGES CXX)

find_package(MPI REQUIRED)

find_package(Kokkos REQUIRED)

add_executable(mpi-test mpi-test.cpp)


target_link_libraries(mpi-test Kokkos::kokkos)
target_link_libraries(mpi-test MPI::MPI_CXX)
target_include_directories(mpi-test PUBLIC ${MPI_CXX_INCLUDE_PATH})

One must have Kokkos installed. If one use the same buffers (by moving the buffers out of the for loop) instead of creating a new buffer everytime before communication, the issue disappear. I think this is related to issue #12971 and #12849

StaticObserver avatar Mar 31 '25 02:03 StaticObserver

@StaticObserver can you please post the mpirun command line that you used? You mention in the title description that you see the issue without UCX. Open MPI 5.0.x without UCX does not have direct intra-node GPU-to-GPU communication path for ROCm devices, it performs a copy through host-memory. Hence, this is in this case unlikely to be the same issues as in the other two tickets, where the culprit is most likely the caching of IPC handles.

edgargabriel avatar Mar 31 '25 12:03 edgargabriel

@StaticObserver can you please post the mpirun command line that you used? You mention in the title description that you see the issue without UCX. Open MPI 5.0.x without UCX does not have direct intra-node GPU-to-GPU communication path for ROCm devices, it performs a copy through host-memory. Hence, this is in this case unlikely to be the same issues as in the other two tickets, where the culprit is most likely the caching of IPC handles.

Sorry that I made an mistake, I used ucx 1.17.0.

The command was very simple

mpirun -np 2 ./build/mpi-test

StaticObserver avatar Mar 31 '25 16:03 StaticObserver

@StaticObserver thank you for the clarification. At this point here are the options that we have

  • UCX master has a new flag that allows you to disable the caching of IPC handles (adding -x UCX_ROCM_IPC_CACHE_IPC_HANDLES=n to your mpirun command line). This will be part of the upcoming UCX 1.19 release later this spring. This will probably take care of the memory issue that you are observing, at the cost of a 10x slowdown in the osu benchmarks. (Depending on how often a buffer is reused for communication operations in your code, the slowdown might be not quite as bad).
  • You mentioned that if you move the memory allocation out of the loop, the problem goes away? If yes, is that an option that could be applied to your code, such that you can take advantage of the IPC handle caching without seeing the memory issues? So basically, resolving the issue in your code for now instead of Open MPI / UCX, since there is no quick and easy fix that we can apply otherwise.
  • We do have a high-level re-design for dealing with IPC memory registration done, but due to timing constraints it will probably not happen before later this year. Hence, the earliest UCX version that could contain that code is UCX 1.20. (which I don't think has a release date yet set)

edgargabriel avatar Mar 31 '25 20:03 edgargabriel

@StaticObserver I can confirm that this is exactly the same as the issue i had (on CUDA) in #12849 . Especially the fact that moving the buffers out of the loop fixes the issue. When the buffers are in the loop a new IPC handle has to be opened for every new buffers and they are never released. As such the amount of IPC handles in flight grow until you get an out of memory error.

This is especially annoying as in our case the buffer size changes constantly (hence the allocation) which forces us to disable IPC entirely ...

tdavidcl avatar Apr 01 '25 19:04 tdavidcl

@StaticObserver thank you for the clarification. At this point here are the options that we have

  • UCX master has a new flag that allows you to disable the caching of IPC handles (adding -x UCX_ROCM_IPC_CACHE_IPC_HANDLES=n to your mpirun command line). This will be part of the upcoming UCX 1.19 release later this spring. This will probably take care of the memory issue that you are observing, at the cost of a 10x slowdown in the osu benchmarks. (Depending on how often a buffer is reused for communication operations in your code, the slowdown might be not quite as bad).
  • You mentioned that if you move the memory allocation out of the loop, the problem goes away? If yes, is that an option that could be applied to your code, such that you can take advantage of the IPC handle caching without seeing the memory issues? So basically, resolving the issue in your code for now instead of Open MPI / UCX, since there is no quick and easy fix that we can apply otherwise.
  • We do have a high-level re-design for dealing with IPC memory registration done, but due to timing constraints it will probably not happen before later this year. Hence, the earliest UCX version that could contain that code is UCX 1.20. (which I don't think has a release date yet set)

Thanks for your quick response. I guess I'll have to figure out a way to by pass the IPC handles.

StaticObserver avatar Apr 02 '25 05:04 StaticObserver

@StaticObserver I can confirm that this is exactly the same as the issue i had (on CUDA) in #12849 . Especially the fact that moving the buffers out of the loop fixes the issue. When the buffers are in the loop a new IPC handle has to be opened for every new buffers and they are never released. As such the amount of IPC handles in flight grow until you get an out of memory error.

This is especially annoying as in our case the buffer size changes constantly (hence the allocation) which forces us to disable IPC entirely ...

In our case the buffers also changes constantly. What's worse is that in ROCM case they don't have the switch to turn off the IPC caching...

StaticObserver avatar Apr 02 '25 05:04 StaticObserver

Indeed, in our case we fallback to host-host communication because of that.

tdavidcl avatar Apr 02 '25 11:04 tdavidcl

In our case the buffers also changes constantly. What's worse is that in ROCM case they don't have the switch to turn off the IPC caching...

you can compile Open MPI with UCX master, in that case you have a switch to turn off the caching of IPC handles., as mentioned above.

If you are looking for another workaround, you can also disable the rocm_ipc mechanism, in which case data transfer will always go through a host bouncing buffer. This can be done with:

mpirun -x UCX_TLS=rocm_copy,sm,ib,tcp,self --mca pml_ucx_tls any -np x ./your_executable

In both cases you will see however a performance degradation.

edgargabriel avatar Apr 02 '25 13:04 edgargabriel

I do agree that this fix involves a degradation of performance. Although shouldn't it become default as the current state of things is a leak on perfectly fine codes ?

Also in general i don't really get why freeing an allocation does not trigger a release of the associated IPC handle, this forces the amount of handles to creep up until crash. Shouldn't at least either a hook on free operations or a max total IPC handle size (limiting on the sum of the memory size in the handles instead of the amount of handles) be implemented ?

tdavidcl avatar Apr 04 '25 03:04 tdavidcl

we are still investigating the best path forward. The truth is that this concept seemed to work for many years without issues (and still works for many application), it looks like its really a new generation of applications that are hitting this issue, since the issue suddenly exploded on the scene and it comes from many different directions. And just to clarify, the issue is not the handle itself, that is just a few bytes. The issue is the reference counting done by the runtime on how often a buffer has been mapped to other processes address spaces, and the memory is only actually released once that reference count goes to zero.

In UCX we do intercept the hipFree operation. The challenge is that a process when it intercepts a free operation would have to notify every process that has mapped that memory into its address space using an ipc_open operation (and keeps it mapped for performance reasons) to please unmap the memory region. De facto, you are making the memory free operation a non-local operation.

edgargabriel avatar Apr 07 '25 12:04 edgargabriel