[FEA] Support stream semantics in rmm::mr::pool_memory_resource with rmm::mr::pinned_memory_resource as an upstream
Is your feature request related to a problem? Please describe.
{
// mr here is rmm::mr::pool_memory_resource<rmm::mr::pinned_memory_resource>
rmm::device_uvector<int> tmps(1024, stream, mr);
thrust::for_each(rmm::exec_policy_nosync(stream), tmps.beign(), tmps.end(),
[]__device__(auto val) { ... });
}
Here, users may expect tmps to follow stream semantics but it gets immediately deallocated once the variable becomes out-of-scope (before the thrust::for_each call actually finishes).
This behavior can be surprising to some users and the fix requires an explicit stream synchronization (and this may have unnecessary performance overhead).
Describe the solution you'd like
rmm::mr::pool_memory_resource to follow stream semantics even when the upstream memory allocator is (host) pinned_memory_resource.
Describe alternatives you've considered Explicitly calling cudaStreamSynchronize();
rmm will be moving to use cudaMallocFromPoolAsync for different memory types, which would resolve this problem. I'm not sure if we have support for everything that we need to do this in CUDA 12, but we certainly do in CUDA 13.
https://github.com/rapidsai/rmm/issues/2054#issuecomment-3339881396 describes a possible solution, it seems like we might have what we need already to do this in CUDA 12.