rmm [FEA] Consider creating a pinned host memory resource that can leverage all of the allocation strategies

trafficstars

Is your feature request related to a problem? Please describe. I wish I could allocate a pool of pinned host memory for allocations that will often times be short lived. An example of when we would use this is when sending information between two nodes using the tcp protocol. We have to copy the memory to host in order to send it over the wire. Having a pinned memory buffer would make this more efficient.

Describe the solution you'd like Allow the memory providers to either specify specific low level allocators or some way of being able to do this for just the specific case of pinned memory.

Describe alternatives you've considered We currently have a fixed size pinned memory pool that we use for this situation.

Oct 29 '20 03:10 felipeblazing

Since pinned host memory is accessible as device memory, I think we should create a pinned_memory_resource that is a device_memory_resource. Then it can be used as upstream of any other resource. The only sticking point I can think of is naming, since we already have a pinned_memory_resource that is a host_memory_resource.

Oct 29 '20 03:10 harrism

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Feb 16 '21 17:02 github-actions[bot]

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

Feb 16 '21 17:02 github-actions[bot]

This is still relevant, but now depends on https://github.com/NVIDIA/libcudacxx/pull/105

Feb 16 '21 23:02 harrism

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 18 '21 23:03 github-actions[bot]

Maybe relevant to cupy/cupy#4892?

Mar 19 '21 00:03 leofang

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 18 '21 01:04 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Nov 18 '21 18:11 github-actions[bot]

I'll take a crack at this since we also need it for Spark.

Jan 19 '22 01:01 rongou

If you proceed ahead of cuda::memory_resource this will just create more for us to refactor.

Jan 19 '22 01:01 harrism

Is there an ETA on cuda::memory_resource?

Jan 19 '22 02:01 rongou

I thought we were close before Christmas. But we keep running into more design issues.

Jan 19 '22 04:01 harrism

Admittedly there may not be a clear answer to this, but are there ways for others interested in this to help pull cuda::memory_resource across the finish line?

Jan 19 '22 05:01 jakirkham

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 18 '22 06:02 github-actions[bot]

Chatted with @jrhemstad offline, one option is to move the current pinned_memory_resource from host to device, then we can wrap it with the existing pools.

@abellina

Mar 10 '22 19:03 rongou

Chatted with @jrhemstad offline, one option is to move the current pinned_memory_resource from host to device, then we can wrap it with the existing pools.

No need to move it, just make a new one that inherits from device_memory_resource instead of host_memory_resource. I think we could even make the existing inherit from both since the functions have different signatures in host vs device.

Mar 10 '22 19:03 jrhemstad

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Apr 09 '22 21:04 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Jul 08 '22 21:07 github-actions[bot]

Chatted with @jrhemstad offline, one option is to move the current pinned_memory_resource from host to device, then we can wrap it with the existing pools.

No need to move it, just make a new one that inherits from device_memory_resource instead of host_memory_resource. I think we could even make the existing inherit from both since the functions have different signatures in host vs device.

I have been playing with this for a test https://github.com/rapidsai/cudf/issues/14314 in cuDF. I think what @jrhemstad said here around the APIs being different between host and device allocations is important. Actually, why would we want allocate(size, stream) and deallocate(size, stream) and why would we want to maintain the stream ordered behavior of pool_memory_resource for the host?

Just for reference, what I did was I hacked out the streams and had a single free_list for the whole pool, then just had an allocate(size) and it did the same pool logic: allocate from blocks in the list, split, merge, etc just with the single pool.

I do think we want to invest time on this given the results I am seeing in https://github.com/rapidsai/cudf/issues/14314, so I'd be curious to know if anyone has cycles to take a look at this.

Oct 23 '23 18:10 abellina

Actually, why would we want allocate(size, stream) and deallocate(size, stream) and why would we want to maintain the stream ordered behavior of pool_memory_resource for the host?

Because pinned host memory is accessed by device copies and kernels, which are stream ordered. Any non-stream-ordered allocation would have to synchronize, losing a lot of the benefits of stream-ordered allocation. I think it would be a mistake not to provide a stream-ordered pinned host memory pool.

so I'd be curious to know if anyone has cycles to take a look at this.

I am planning to work on this because I am also finding it is needed to improve spilling performance for Python. Current plan is to do it once we adopt cuda::memory_resource, because it turns the kind of memory allocated by the MR into a (template) parameter, rather than having separate host_mr and device_mr types. This enables us to use any machinery built for allocating one kind of memory (e.g. device memory) for any other memory kind. cuda::memory_resource also separates synchronous and stream-ordered allocation APIs.

This said, we should also look into timing and cost (including opportunity cost) and

whether a temporary stop-gap like described above is worth committing to RMM
whether we should just wait for cuda::memory_resource to be merged and then build it "the right way"
whether we should provide a stop-gap in a non-release branch of RMM and then do (2).

Oct 24 '23 22:10 harrism

rmm rmm copied to clipboard

[FEA] Consider creating a pinned host memory resource that can leverage all of the allocation strategies

rmm
rmm copied to clipboard