kokkos
kokkos copied to clipboard
Memory Pool size setting for cudaMallocAsync
A patch to allow setting the pool size for cudaMallocAsync
Implementation
The first time cudaAsyncMalloc is called (in void* impl_alloc_common() in Kokkos_CudaSpace.cpp) we check the environment variable KOKKOS_CUDA_MEMPOOL_SIZE . We overallocate this by 64 bytes (an arbitrary amount, just to get us over the size), then set some properties on the device default mempool to retain KOKKOS_CUDA_MEMPOOL_SIZE of memory after an async free. Subsequent allocations are faster (by between 1 to 2 orders of magnitude) for sufficiently large sized chunks of memory which fit in the Pool.
Efficacy
A benchmark test has been placed in kokkos/benchmarks/async_test which can be built using the 'Makefile' setup.
To execute it, export the KOKKOS_CUDA_MEMPOOL_SIZE and run async_alloc.cuda
. The utility will range through allocations from 8B to 16GB.
And collect timing of allocating (and freeing) a Kokkos::View.
The -d
flag to async_alloc.cuda
can be used to specify cycling downwards i.e. from 16GB to 8B.
The attached PDF shows the benchmark times, sweeping up from 0 to 8GB sizes, with various mempool settings show the gains from the async allocator from allocation sizes of 512KB upwards
- at about 16MiB the allocator with an unspecified (0) poolsize becomes as expensive as using cudaMalloc and in fact becomes worse
- using a pool maintains an advantage of between 1 to 2 orders of magnitude depending on the allocation size
- after 4GB the allocation efficiency with the 4.2GB pool starts to deteriorate as we run out of pool space.
This data is from an Ada L40S GPU. Other GPU architecture benchmarks are work in progress just now. AsyncAllocUp.pdf