Pathological reallocation in memory pools
Describe the bug
With certain allocation patterns and sizes QuickPool never coalesces a backing allocation to be large enough, leading to unnecessary reallocations.
To Reproduce
Compile the following file with anything from 2024.02.01 to current develop (older versions likely also affected, but have not tested):
#include <umpire/ResourceManager.hpp>
#include <umpire/strategy/QuickPool.hpp>
#include <iostream>
#include <cstddef>
int main() {
auto alloc = umpire::ResourceManager::getInstance().getAllocator("DEVICE");
auto pooled_alloc =
umpire::ResourceManager::getInstance().makeAllocator<umpire::strategy::QuickPool>(
"DEVICE_pool", alloc,
16, // initial pool size important only in the sense that allocations must be larger than the initial size, otherwise no additional allocations are required
16, // next size is not important to reproduce
16); // alignment important: if all allocations are multiples of alignment all is good
const std::size_t niter = 5;
for (std::size_t iter = 0; iter < niter; ++iter) {
std::cerr << "iteration " << iter << '\n';
auto* p1 = pooled_alloc.allocate(1024 + 8);
auto* p2 = pooled_alloc.allocate(1024 + 8); // need at least two allocations not multiples of alignment
pooled_alloc.deallocate(p2);
pooled_alloc.deallocate(p1);
}
}
Just to print the allocations I ran this through gdb with the following gdb script:
start
break cudaMalloc
commands
continue
end
break cudaFree
commands
continue
end
continue
and then
gdb -batch -x gdbscript umpire_test
produces:
...
iteration 0
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 1
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 2
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 3
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
iteration 4
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 2, 0x0000fffff59fac64 in cudaMalloc () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
Thread 1 "miniapp_umpire_" hit Breakpoint 3, 0x0000fffff59fb2d0 in cudaFree () from /capstor/scratch/cscs/simbergm/src/DLA-Future/build/spack/src/libDLAF.so
...
Sizes are not visible above, but the first iteration first allocates twice, once for the initial buffer and once more to accommodate the second allocation. It then coalesces, freeing the two backing allocations above, and allocating one larger block. However, this block does not take overallocation into account, and is too small for the next iteration. The pool then always allocates for the second user-triggered allocation, frees both backing allocations, and attempts to coalesce. Repeat.
If one of the user-requested allocations is 16-byte aligned the reproducer correctly allocates only on the first iteration and then never again.
I suppose using umpire's logging should produce the same results, I'm just not familiar enough with it.
I used DEVICE only to make it easier to hook gdb into cudaMalloc and avoid printing potential other allocations. The same behaviour should happen with any backing memory.
I expect this affects DynamicPoolList as well.
Expected behavior
I expect that repeating the same allocation and deallocation pattern multiple times (no matter what initial size, next size, alignment, and user-requested allocation sizes are used), letting the pool empty between iterations to allow for coalescing (with the default coalescing strategy), would lead to allocations only during the first iteration and when coalescing after the first iteration. After that all allocations should fit in the pool without any further allocations.
Compilers & Libraries (please complete the following information):
- Compiler & version: GCC 12.3.0
- CUDA version (if applicable): CUDA 12.2.1 (not relevant to reproduce though)
Additional context
If I understand the pools correctly, the reason for this behaviour is that the actual size and the high water mark don't include the overallocation. They only sum up the sizes requested by the user, not the sizes actually allocated eventually, so there can be cases like the above where the pool never grows big enough to fit all required allocations.