rmm
rmm copied to clipboard
[FEA] Python interface for `arena_memory_resource`
Is your feature request related to a problem? Please describe.
Hi, I'm trying to use rmm in a multi-threaded application where each thread pre-fetches some data using a cuda stream from a pool. The data is fetched in a child thread and used in the main thread with the same cuda stream, and it's released after usage in the main thread. During profiling, I found that the memory allocation and deallocation are not performed in parallel with pool memory resources. Later I found the arena_memory_resource in c++ and want to try it out.
Describe the solution you'd like
So the feature request is a 2 part question. Firstly, does arena_memory_resource help with such use cases? If so is there any plan on exposing it to the Python interface?
Describe alternatives you've considered I tried pool memory allocator and cuda async memory resource, the performance is similar. From nsight system, the pool memory resource seems to be managing memory with locks and hence preventing parallel malloc and free. Also, the cudaEvent used in rmm also seems to be creating locks, but I'm not sure what's its effect on performance.
Additional context Feel free to ping me if you need the profile result from nsight system.
We use arena_memory_resource for Spark in java/scala so didn't have a need for the python wrapper. Probably don't have time to work on this in the near future. @trivialfis feel free to contribute. :)
Might not be needing the feature right now .. Suggested by @jrhemstad the issue in my code is caused by pageable host memory. So I switched to pinned memory but its allocation cost is the bottleneck now.
Two other points:
-
I believe arena_memory_resource uses separate read and write locks which may enable more concurrency between host threads. We can try something similar in
pool_memory_resource. -
Just discovered today that multi-stream cycling through buffers can result in oversynchronization in
stream_ordered_memory_resource. I think this can be improved by using an LRU cache or something similar to choose which stream to "steal" blocks from. This may also benefit multi-threaded use cases where each thread has its own stream (per-thread default stream).
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.