rmm
rmm copied to clipboard
[FEA] A pool memory resource backed by virtual memory
A common issue with the current rmm::pool_memory_resource
is fragmentation. Is it possible to provide a pool memory resource that is backed by a virtual address space to hide fragmentation?
@madsbk is experimenting with this using the RMM Python API here. It would be great if this could eventually be upstreamed to C++.
I will be happy to port https://github.com/rapidsai/dask-cuda/pull/998/ to C++ when we get it to work. Currently, @pentschev and I are working on UCX support.
It's something we have discussed. However I wonder if just using the MR we have that uses cudaMallocAsync
wouldn't solve the same problem. I believe cudaMallocAsync
is backed by the same virtual memory APIs and should do a pretty good job of minimizing fragmentation.
https://github.com/rapidsai/rmm/blob/branch-22.10/include/rmm/mr/device/cuda_async_memory_resource.hpp
We have considered cudaMallocAsync
but the problem is UCX support, which requires a more low-level control of the physical and virtual memory mapping.
As far as we can tell, UCX requires:
- Once used in communication, physical memory should never be freed.
- Once used in communication, mapped virtual addresses should never be unmapped.
We are currently trying an approach where our RMM resource maintains a pool of physical memory blocks that are mapped to virtual addresses at mr.allocate()
such that the user sees one contiguous memory allocation.
To support UCX, we split a user allocation back into its underlying physical memory blocks and translate UCX operations into a series of operations on the physical memory blocks.
On the Spark side we use bounce buffers for UCX communication. Generally RMM does not play nicely with UCX so we allocate several bounce buffers using regular cudaMalloc
that we can use to send/receive data. Having several of them allows us to be filling one, while the other is being sent/etc. I know it is not zero copy, but the performance impact is relatively small. @abellina might be able to comment more about how we tuned it all and the sizes. But it turned out that the size needed was not that big.
We use normally 4MB buffers, and we set aside up to ~500MB for these buffers (except for T4s, we normally set aside ~200MiB there). The main benefit of bounce buffers is that they allow us to work around BAR issues (we don't have to worry about 256MB BAR spaces, we hardcode how much BAR we will ever register), and also allows us to have a regular cudaMallocAsync
pool for the remaining of our app (we can copy from GPU memory allocated in the pool to a bounce buffer). Additionally, memory registration (ibv_reg_mr
, and opening of IPC mem handles) happens once, and early during startup, not at task time.
Another benefit is that we can use really fast D2D copies to pack these buffers with what would have been many small calls to send/recv. All of this requires a metadata layer that defines what is in a buffer, so we do have to send that metadata message ahead of our actual message.
The main drawback is that now we have a hard upper limit for the amount bytes in flight and that we have extra d2d copies and complexity to manage this.
Thanks @revans2 and @abellina for the comments. That was indeed one of our ideas, just use a bounce buffer for UCX, and the remaining memory for our regular pool. It is something we could explore more, and that is still in @madsbk and mine TODO list.
As a more complex solution, but potentially more performant, Mads and I were prototyping a pool that utilizes small memory blocks backed by physical memory that we can distribute to the user application upon the user's request, either by delivering a piece of one of the blocks if the allocation is smaller, or combining multiple blocks for larger allocations. In this way it seems possible to possible to not use a bounce buffer at all, given we could pre-register all of the allocations (except for devices with small BAR sizes) and, for now, let the application (e.g., UCX-Py) deal with a buffer than spans multiple blocks by transferring them in multiple steps.
I am not familiar with the APIs to get physical memory so I can't comment on that at this point, but I would think that if you can allocate from a different pool for data that is destined to UCX from the beginning, that is clearly better. It removes the D2D copy, and potentially brings more benefit with the physical addresses(?) (cheaper to register?)
In Spark, we use UCX much later after the original results are produced. We cache a bunch of GPU buffers that will eventually be transferred. This means we want to get access to the whole GPU, because if it fits, it's really great. If it doesn't fit, we have to spill to host memory or disk. Regardless, the UCX transfer happens after all of the writes have completed, so at that point we would need to send from random addresses on the GPU (assuming it all fit), or copy to the bounce buffer and then send.