Hello RMM team,

I'm currently using the RMM device memory pool rmm::mr::pool_memory_resource<rmm::mr::cuda_memory_resource> as described in the README. My program operates in with multistream and enabled PTDS. When profiling, I've noticed that the mutex operations in do_allocate and do_deallocate become a major bottleneck, for example when using 32 streams.

Bottleneck Example

Let me illustrate this with a simple profiling result. I'm using an RMM device buffer rmm::device_buffer as temp buffer for calling cub::reduce_by_key. The execution time of the CUB kernel (~40μs) is only a tiny fraction of the total time. However, both do_allocate and do_deallocate each take over 1ms, due to concurrent threads holding the mutex during allocation or deallocation. This issue prevents the CPU from efficiently launching more kernels, and as a result, the GPU isn't fully utilized.

The first pthread_mutex_lock comes from do_allocate:

The last pthread_mutex_lock comes from do_deallocate:

In RMM: `stream_ordered_memory_resource`

https://github.com/rapidsai/rmm/blob/da3f558357033ee9b1caed70109453d9a35b9630/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp#L205-L259

I've examined the code in RMM and believe that the critical sections in do_allocate and do_deallocate are quite large. Given that my program uses PTDS, I think some parts of these functions could be moved outside the critical section. However, implementing such a fix isn't straightforward, and I haven't found an obvious solution yet :(

My Questions

My main question is how to work around the bottlenecks caused by do_allocate and do_deallocate. Should I consider switching to another memory resource inside RMM? Or is this overhead an inevitable cost of using a pooled memory system? Ultimately, I aim to fully optimize the memory allocation process from the CPU mutex side without being limited by these bottlenecks.

Apr 08 '25 21:04 JigaoLuo

I've attempted to create a minimal reproducible example in Python. I believe the same issue should occur in Python as well. However, the stack information obtained from nsys in Python isn't as detailed as the profiling results above (I used CUB in the C++ implementation in the profiling results above).

In Python, I'm performing a very straightforward task. For each non-default stream, it repeatedly allocates device buffers and then deallocates them automatically. Since there's no actual computation involved, I'm using NVTX to mark the buffer operation time. Unfortunately, in this setup, Python can't distinguish between the allocation time and the deallocation time. Nevertheless, my main goal here is just to illustrate the problem.

Profiling Results

Single Stream:

32 Streams: When using 32 streams, each NVTX range may take over 10 ms for both allocation and presumably deallocation operations.

Code

import rmm
from rmm import pylibrmm
from rmm.pylibrmm.stream import Stream, DEFAULT_STREAM, PER_THREAD_DEFAULT_STREAM
import argparse
import threading
import nvtx

BUFFER_SIZE = 1024
ITER = 1000

def stream_alloc():
    s = Stream()
    assert s.is_default() == False # DeviceBuffer on non-default stream has no sync
    for _ in range(1000):
        rng = nvtx.start_range(message="DBuffer")
        buf = pylibrmm.device_buffer.DeviceBuffer(size=BUFFER_SIZE, stream=s)
        nvtx.end_range(rng)
    s.synchronize()

def create_device_buffers_in_streams(num_threads):
    threads = []
    for _ in range(num_threads):
        t = threading.Thread(target=stream_alloc)
        threads.append(t)
        t.start()
    for t in threads:
        t.join()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Create rmm device buffers in multiple streams.")
    parser.add_argument("num_threads", type=int, help="Number of CUDA streams/Threads to use.")
    args = parser.parse_args()
    num_threads = args.num_threads
    pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource(), initial_pool_size="20GiB", maximum_pool_size="30GiB")
    rmm.mr.set_current_device_resource(pool)
    create_device_buffers_in_streams(num_threads)

Apr 09 '25 20:04 JigaoLuo

Thanks for this report! I haven't looked at this in the same detail you have, but your analysis makes sense, and we would certainly be impacted by this as well in situations where we use the pool mr (or any stream-ordered mr in rmm). Unfortunately, I suspect that as you noted implementing a fix for this probably isn't super straightforward. What we probably need is to replace the existing std::map used to maintain the list of blocks with a lock-free, thread-safe data structure that will allow safely handling per-thread block allocation. However, a data structure change alone is probably insufficient given the nature of the block allocation that looks across different streams and events, so we will probably need to use some atomics within the get_block implementation as well to ensure that it safely handles the switch when searching for data on other streams. I think it could be worthwhile, but it wouldn't be a small project to get right. Do you have any familiarity with such implementations?

Apr 16 '25 22:04 vyasr

@vyasr Thank you for your reply. The reason I raised this issue was to ensure that I'm using RMM correctly. From your response, I now understand that this stream-ordered device memory allocation is/could indeed blocked by a single mutex.

Regarding my contribution: I have thoroughly read the relevant code. Fortunately, there isn't a vast amount of code in the single header file include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp, so the scope of my review wasn't overly large. I grasp the concept of lock-free memory allocation using atomic operations. I'm eager to contribute and realize that I'll need to have discussions and ask questions to those who are well-versed in RMM.

Here's something as a comment I'd like to add:

While I was reading the code, I also pondered whether there's extra room for performance improvement only for cases where PTDS is enabled. PTDS is an assumption to have and might requires less sync or mutex-protection.
When tracing this issue back to the code, I thought about the possibility of allocating an additional thread-local mutex for each PTDS. This way, it could achieve extremely fast memory allocation along "the hot path" as the most frequent allocation-pattern, where the streams associated with PTDS could simply get thread-locally managed memory and are not blocked by only one monolithic mutex in the header now.

Apr 17 '25 06:04 JigaoLuo

Feel free to continue discussions on this thread! We are happy to assist you with any information that you might need if you are interested in making contributions here!

The current naming of the pool and stream ordered mrs is somewhat inaccurate. All of rmm's mrs are stream-ordered by construction. Currently the stream_ordered_memory_resource effectively assumes some sort of pool structure exists, so it is more than just a stream-ordered mr, it is really aimed at capturing some concept of allocating from a pool of blocks. The mutex is needed because different threads are allocating from the same pool of blocks. The fact that streams maybe associated with particular threads is something of a red herring here; the important thing is that multiple threads are interacting with the pool data structure.

You propose to use a thread-local mutex, but you also suggest using thread-locally managed memory. I believe the latter would actually make the former superfluous. If you are allocating memory per thread (i.e. if you had a separate pool for each thread) then you would never need to lock anything because each thread would be operating completely independently (you would be relying on the thread safety of underlying cuda allocation operations, but that is fine). My guess is that such an implementation would encounter different performance issues, though:

You would increase fragmentation since every thread would be blocking off a subset of memory for itself
You would significantly increase the number of pool resizing operations since each thread would start with a much smaller pool
If you have heterogeneous memory requirements across threads, contention between the subpools for GPU memory could lead to even more fragmentation and make it more challenging to get the allocations that you need

My guess is that we can do better than that with a smart implementation of get_block.

Apr 22 '25 19:04 vyasr

Hi @vyasr, I performed a general runtime breakdown of my benchmark workload. The setup uses PTDS with a memory pool initialized to 95% of an A100, and the following analysis steps assume this configuration.

The get_event function only accesses the thread_local vector and returns the same stream_event_pair (though as a copy each time). It creates the event only at the start of each stream, then store in the thread_local vector for reusing.

The get_event can be protected by an individual lock.

In each stream, get_block first calls get_block_from_other_stream to insert the new entry of this stream into std::map stream_free_blocks_. After a few calls, stream_free_blocks_ stabilized in its size, and then all get_block calls find a block in the same stream's free list in my workload. Once stabilized, accessing stream_free_blocks_ is thread-safe, as each thread-stream pair has its own entry and each thread only does lookup to this std::map.

With 32 streams, it stabilized at size 35 (I need to investigate why not 33).
TLDR: the size of get_block_from_other_stream grows from 0 to O(num_stream), then it stay stabilized and only be looked up, if get_block_from_other_stream is not called later on. I call it "hotpath": https://github.com/rapidsai/rmm/blob/a828642553fd72610f1361240b3473bb72a46d51/cpp/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp#L336-L341
~[Update]: Still can't get a fully correct lock-free hotpath. Need to check if get_block_from_other_stream is causing the segfault.~
[NewUpdate]: I did an implementation for hotpath with read-write lock and each free_list has also own mutex. So not lock-free but current performance looks good. I will continue to report

Then I think a lock-free hashtable isn't 100% ideal for stream_free_blocks_ since it's rarely inserted/updated but frequently looked up. A fix could involve redesigning the critical section to make the "hotpath" lock-free, which I think is still possible with std::map. The "coldpath" is then calling get_block_from_other_stream which needs a lock.

However, do_deallocate calls free_block to lookup another global std::set. This also needs to be investigated.
- [Update]: It could be thread-safe as the free_block call only does lookup on a std::set.
Regarding lock-free data structures: While stream_free_blocks_ could be implemented as a lock-free hashtable, but there's little need for mixed read/write operations on it.

Again this analysis applies only to PTDS. Please correct me if my understanding is off.

May 06 '25 19:05 JigaoLuo

I'll submit a WIP Pull Request shortly. Before that, I've been profiling with more NVTX and details to identify this issue related to cuDF's Parquet reading under multi-streaming scenarios. Here's the core insight:

Inside libcudf, the typical workload flow is as follows: A function first requests memory from the RMM memory pool, then it must wait until it acquires the mutex (then get the device memory). After that, it launches a kernel on the allocated memory. And these steps are repeated for other kernels.
This workload closely resembles that of CUB. In CUB, the process involves first determining the required memory size, followed by memory allocation, and finally, launching the CUB kernel.

A crucial observation (which I will mention again in the PR) is that:

assuming there are two CUB kernels to launch, after asynchronously launching the first CUB kernel, the function proceeds to request the next chunk of device memory from the pool and enters a synchronous wait for the mutex.
As a result, the mutex waiting on the CPU mutex overlaps with the asynchronous execution of the CUB kernel on the GPU.

I only truly grasped this overlap after having a dummy working PR. Prior to that, I had hypothesized that optimizing this issue could yield significant speed improvements.

May 11 '25 16:05 JigaoLuo

Have you looked at arena_memory_resource? We wrote it mainly for the RAPIDS Spark use case, which sounds similar to what you describe, lots of CPU threads with PTDS.

May 13 '25 15:05 rongou

@rongou Hi I haven't used arena_memory_resource, at least not directly, so I'm unfamiliar with it. My use case is mainly the RMM README example by using pool_memory_resource (internally stream_ordered_memory_resource) with cudf.

A quick code read of arena_memory_resource reveals that its locking mechanism is better than stream_ordered_memory_resource due to the two shared mutexes: https://github.com/rapidsai/rmm/blob/a828642553fd72610f1361240b3473bb72a46d51/cpp/include/rmm/mr/device/arena_memory_resource.hpp#L359-L362

May 13 '25 19:05 JigaoLuo

A per-thread arena_memory_resource or pool_memory_resource (optionally with a fallback to a shared upstream) would be ideal when performing multi-threaded allocations. There will always be contention and synchronization overhead as long as there's sharing of resources or frequent allocations. That's the only way to totally eliminate contention.

May 22 '25 11:05 lamarrr

Hi @lamarrr , Thanks for the feedback! I initially considered this approach and posted at Apr 17 abvoe, but as @vyasr pointed out, it could cause memory fragmentation within individual threads/streams. I agree on his argument, especially since RMM needs to remain generally applicable.

That’s why I lean toward a shared memory pool across all threads, though in my PR draft #1912, I differentiated between hot and cold paths to achieve a huge speedup with C++ mutex only. Then microseconds-latency should be fine in my cudf Parquet reading workloads.

May 22 '25 11:05 JigaoLuo

Closed as discussed in #2035 Thanks to all who commented and helped!

Sep 16 '25 19:09 JigaoLuo

rmm
rmm copied to clipboard

[QST] How to reduce the overhead & bottlenecks of `do_allocate` and `do_deallocate` with multistreams?

Bottleneck Example

In RMM: `stream_ordered_memory_resource`

My Questions

rmm rmm copied to clipboard

[QST] How to reduce the overhead & bottlenecks of `do_allocate` and `do_deallocate` with multistreams?

Bottleneck Example

In RMM: stream_ordered_memory_resource

My Questions

rmm
rmm copied to clipboard

In RMM: `stream_ordered_memory_resource`