rmm icon indicating copy to clipboard operation
rmm copied to clipboard

[PERF] Measure impact of async allocator priming the memory pool

Open bdice opened this issue 5 months ago • 3 comments

Today, RMM's CUDA async allocator "primes" the memory pool by allocating and deallocating a large chunk of memory. https://github.com/rapidsai/rmm/blob/462d2baaf97ee8bcbe61ff9f302ac55d2349f5c9/cpp/include/rmm/mr/device/cuda_async_memory_resource.hpp#L146-L150

This is meant to help with later sub-allocations and reducing fragmentation. However, there are some downsides:

  1. This consumes a lot of memory for a temporary period, which is not ideal when running lots of parallel processes that only need small allocations
  2. It uses the default stream, which breaks cudf's stream detection logic https://github.com/rapidsai/cudf/pull/18603#issuecomment-2847949062 a. Does the async MR constructor need to take an optional stream on which to do the priming?
  3. Systems with integrated memory may suffer, because large allocations of GPU memory also starve CPU memory

We should run benchmarks on a range of systems to understand the impact of this "priming" and make it optional. The default value (prime or not) may need to be system-dependent. The benchmarks to run should include several allocation patterns, e.g. a few large allocations, many small allocations, a mixture of allocation sizes, etc.

bdice avatar May 28 '25 22:05 bdice

Could we remove the priming from the ctor and just document that the user/caller may see better performance if they prime it themselves? Seems like we should not be assuming too much about how the memory resource is to be used by the calling application. You bring up several good examples of why arbitrary priming is not compatible for all conditions. The priming can be done by the calling application where appropriate (and with a proper stream).

This may mean that casual users (who do not know about priming) may not see a performance benefit right away but perhaps that is an issue we should raise with the CUDA team -- perhaps they should improve the malloc API to do some smart priming. I'm guessing they will want to leave this kind of control up to the calling application as well.

davidwendt avatar May 29 '25 00:05 davidwendt

Perhaps priming becomes a utility that RMM provides. Levels higher in the stack can call it / set a configuration for it.

harrism avatar May 29 '25 03:05 harrism

I'm guessing they will want to leave this kind of control up to the calling application as well.

I agree with @davidwendt here, the user would have a better idea of how to prime the allocator based on their environment configuration.

@bdice @harrism do you have a specific benchmark or workload in mind? I'm thinking a benchmark that is parameterized on, min and max allocation size, allocation distribution function of sort, and it would output the effective utilization.

lamarrr avatar Jun 17 '25 21:06 lamarrr

re-pinging @bdice

lamarrr avatar Aug 04 '25 19:08 lamarrr

Sorry, I missed your last comment here. I think your proposal sounds good. Something like:

  1. Either prime (or not) the async allocator
  2. Do some allocations -- I like the parameters you suggested above. It's fine to do something simple / regular like 10 allocations of 10% of the memory, 20 allocations of 5% of the memory, etc., then deallocate all and repeat the allocations.
  3. Measure latency to first allocation, throughput of first round of allocations, throughput of second round of allocations

bdice avatar Aug 04 '25 19:08 bdice

initial_pool_size is optional, and yes the default value of free/2 is probably a little random. Could we keep the behavior if and only if initial_pool_size is set, and omit priming if it isn't?

Note that Spark sets initial_pool_size == release_threshold. That way, for us, it locks the memory we want, since it is allocated and not returned back to CUDA.

That said, I think we are open to whatever change. We do use the initial_pool_size arg, so we would want to adjust our code.

abellina avatar Aug 04 '25 20:08 abellina

Yes, we're currently only exploring a change of behavior to skip priming when initial_pool_size is left as the default of std::nullopt.

bdice avatar Aug 04 '25 20:08 bdice

Sorry, I missed your last comment here. I think your proposal sounds good. Something like:

1. Either prime (or not) the async allocator

2. Do some allocations -- I like the parameters you suggested above. It's fine to do something simple / regular like 10 allocations of 10% of the memory, 20 allocations of 5% of the memory, etc., then deallocate all and repeat the allocations.

3. Measure latency to first allocation, throughput of first round of allocations, throughput of second round of allocations

This experiment sounds right. I would suggest rerunning the experiment for a few different initial pool sizes and sequences of allocations that stress test slightly different fragmentation/pool resizing behaviors to get a sense of how expensive (or not) it is to prime the pool with an allocation that is not commensurate to the typical memory usage patterns of an application.

vyasr avatar Sep 24 '25 19:09 vyasr

In https://github.com/rapidsai/rmm/pull/2051, I ran PDS-H benchmarks and also microbenchmarks. Priming the async allocator seems to have no beneficial impact for workflow performance and increases startup costs, so I recommend we disable priming by default.

bdice avatar Sep 29 '25 13:09 bdice