Umpire icon indicating copy to clipboard operation
Umpire copied to clipboard

Pathological reallocation in memory pools

Open msimberg opened this issue 1 year ago • 0 comments

Describe the bug

With certain allocation patterns and sizes QuickPool never coalesces a backing allocation to be large enough, leading to unnecessary reallocations.

To Reproduce

Compile the following file with anything from 2024.02.01 to current develop (older versions likely also affected, but have not tested):

#include <umpire/ResourceManager.hpp>
#include <umpire/strategy/QuickPool.hpp>

#include <iostream>
#include <cstddef>

int main() {
    auto alloc = umpire::ResourceManager::getInstance().getAllocator("DEVICE");
    auto pooled_alloc =
        umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>(
            "DEVICE_pool", alloc,
            16,  // initial pool size important only in the sense that allocations must be larger than the initial size, otherwise no additional allocations are required
            16,  // next size is not important to reproduce
            16); // alignment important: if all allocations are multiples of alignment all is good

    const std::size_t niter = 5;
    for (std::size_t iter = 0; iter < niter; ++iter) {
            std::cerr << "iteration " << iter << '\n';
            auto* p1 = pooled_alloc.allocate(1024 + 8);
            auto* p2 = pooled_alloc.allocate(1024 + 8); // need at least two allocations not multiples of alignment
            pooled_alloc.deallocate(p2);
            pooled_alloc.deallocate(p1);
    }
}

Just to print the allocations I ran this through gdb with the following gdb script:

start

break cudaMalloc
commands
continue
end

break cudaFree
commands
continue
end

continue

and then

gdb -batch -x gdbscript umpire_test

produces:

...
iteration 0
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 1
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 2
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 3
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 4
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
...

Sizes are not visible above, but the first iteration first allocates twice, once for the initial buffer and once more to accommodate the second allocation. It then coalesces, freeing the two backing allocations above, and allocating one larger block. However, this block does not take overallocation into account, and is too small for the next iteration. The pool then always allocates for the second user-triggered allocation, frees both backing allocations, and attempts to coalesce. Repeat.

If one of the user-requested allocations is 16-byte aligned the reproducer correctly allocates only on the first iteration and then never again.

I suppose using umpire's logging should produce the same results, I'm just not familiar enough with it.

I used DEVICE only to make it easier to hook gdb into cudaMalloc and avoid printing potential other allocations. The same behaviour should happen with any backing memory.

I expect this affects DynamicPoolList as well.

Expected behavior

I expect that repeating the same allocation and deallocation pattern multiple times (no matter what initial size, next size, alignment, and user-requested allocation sizes are used), letting the pool empty between iterations to allow for coalescing (with the default coalescing strategy), would lead to allocations only during the first iteration and when coalescing after the first iteration. After that all allocations should fit in the pool without any further allocations.

Compilers & Libraries (please complete the following information):

  • Compiler & version: GCC 12.3.0
  • CUDA version (if applicable): CUDA 12.2.1 (not relevant to reproduce though)

Additional context

If I understand the pools correctly, the reason for this behaviour is that the actual size and the high water mark don't include the overallocation. They only sum up the sizes requested by the user, not the sizes actually allocated eventually, so there can be cases like the above where the pool never grows big enough to fit all required allocations.

msimberg avatar Aug 29 '24 14:08 msimberg