libcudacxx
libcudacxx copied to clipboard
Implement `any_resource_ref`
This implements the NVIDIA internal proposal for a any_resource_ref
abstraction that facilitates the handling of memory allocations on host and device side.
To facilitate adoption the proposal is backported to C++14 / C++17 through a concept emulation designed by @ericniebler. Unfortunately, C++11 is too far a stretch due to missing support for constexpr variables.
Currently only plain any_resource_ref
resource is implemented, as async_resource_ref
is more or less a straightforward extension.
Things still to do:
- [x] Implement
async_resource_ref
- [x] Implement
equality_comparable
concept - [ ] Write some documentation
fyi @miscco we'll likely want to steal some things from https://github.com/NVIDIA/libcudacxx/pull/158, like stream_view
(though probably rename to stream_ref
) and maybe some of the exception stuff. It also has quite a few tests that could be helpful for showing what kind of tests we want here.
NOTE to reviewers:
I have pushed the concept emulation into its own branch and targeted this PR to that branch. Consequentlly, we can review only the memory_resource
part of the PR. Before merging we would need to change the target branch back to main
@griwes I pushed a small change to the interface of {async_}resource_ref
after your review.
We now also accept allocate(size)
and allocate_async(size, stream_ref)
and default to alignof(max_align_t)
for the alignment in that case
default to alignof(max_align_t) for the alignment in that case
I know we have discussed this in the past, but the default alignment warrants one last look.
cudaMalloc
's default guaranteed alignment is 256B. Whereas alignof(max_align_t) == 16
.
alignof(max_align_t)
would be more consistent with std::pmr::memory_resource
whereas 256
B would be more consistent with CUDA.
People often rely on the over alignment of cudaMalloc
allocations for a variety of reasons, mostly performance.
Granted, anyone can always explicitly request a 256
B alignment, but people may find it annoying to have to be explicit about it every time when using cudaMalloc/cudaMallocManaged/etc
give it to you implicitly.
Defaulting to 256 would be disastrous for both CPU side resources (the malloc one would overallocate by 255 bytes every time!) and for pool resources for the GPU (because now you can't subdivide a single block of GPU memory into chunks smaller than 256). 16 is the only sane default if you ask me.
Whatever default we choose, it's going to be annoyingly required to spell something else out for some use cases. What we could do to reduce this a tiny little bit, at least for resource_ref, is put it in a template parameter of the type and figure out aliases with/without it defaulted, but IDK if that's really worth it.
yeah every default we choose will be the wrong one.
That said, from my perspective there are 2 pretty fine solutions for users:
- Provide the desired alignment.
- Ignore the passed alignment in their user provided allocation function. If they want 256 anyhow, they can do so just fine
@wmaxey Is this the right way to test host only functionality or do I need to put them into the heterogeneous
subfolder?