libcudacxx
libcudacxx copied to clipboard
CUDA <memory_resource> Extension Proposal
CUDA <memory_resource>
Extension Proposal
PR Tracker
- [ ] https://github.com/NVIDIA/libcudacxx/pull/105 Implement
memory_resource
,stream_ordered_memory_resource
,stream_view
, andmemory_kind
- [ ] TODO Implement derived resource implementations
- [ ] TODO Design/Implement allocator for
cuda::memory_resource
- [ ] TODO Design/Implement allocator/allocator_traits for
cuda::stream_ordered_memory_resource
Motivation
Performance-sensitive applications that make frequent dynamic memory allocations often find allocating memory to be a significant overhead. CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of cudaMalloc/cudaFree
compared to standard malloc/free
. As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. Interoperation among these applications is difficult without an interface to enable sharing a common allocator.
In Standard C++, Allocator
has traditionally provided this common interface. C++17 introduced <memory_resource>
and the std::pmr::memory_resource
abstract class that defines a minimal interface for (de)allocating raw bytes and sits below Allocator
. This optionally polymorphic interface provides a standard way to define, expose and share custom memory allocation.
However, the std::pmr::memory_resource
interface is insufficient to capture the unique features of the CUDA C++ programming model. For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications may access at least four different kinds of dynamically allocated memory. Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of cudaMallocAsync
and cudaFreeAsync
1, which provide stream-ordered memory allocation and deallocation. Therefore, there is a need for a common allocator interface similar to std::pmr::memory_resource
that accounts for the unique features of CUDA C++.
[1]: Note that cudaMallocAsync
obviates neither the need for custom, CUDA-aware allocators, nor the need for a common allocation interface. There will never be one allocator that satisfies all users. Furthermore, a common interface allows composing and layering utilities like logging, limiting, leak checking, and tracking.
Description
We propose extending <memory_resource>
to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.
We chose <memory_resource>
as the basis for a CUDA-specific allocator interface for several reasons:
-
<memory_resource>
is the direction taken by Standard C++ for custom, stateful allocators. An allocator interface with a common look and feel will ease working between Standard and CUDA C++. For more information on<memory_resource>
see here and here. -
The RAPIDS Memory Management library has had three years of success using its
rmm::device_memory_resource
interface based onstd::pmr::memory_resource
. -
Likewise, Thrust has had similar success with its
thrust::mr::memory_resource
interface.
Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, <memory_resource>
is the logical choice.
This proposal includes the addition of the following to libcu++:
cuda::memory_kind
A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory.
This is intended to be similar to the existing thread_scope
enum.
enum class memory_kind {
device, ///< Device memory accessible only from device
unified, ///< Unified memory accessible from both host and device
pinned, ///< Page-locked system memory accessible from both host and device
host ///< System memory only accessible from host code
};
cuda::stream_view
Similar to std::span
or std::string_view
, cuda::stream_view
is a strongly typed, non-owning, view type for cudaStream_t
. This type provides a more type-safe C++ wrapper around cudaStream_t
and serves as the input argument type for any libcu++ API that takes a CUDA stream.
cuda::memory_resource
The cuda::memory_resource
class template is the abstract base class interface akin to std::pmr::memory_resource
with two main differences:
-
The
Kind
template parameter determines thememory_kind
allocated by the resource. -
The
Context
template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization. By default, theContext
is theany_context
tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.
/**
* @brief Tag type for the default context of `memory_resource`.
*
* Default context in which storage may be used immediately on any thread or any
* CUDA stream without synchronization.
*/
struct any_context{};
template <memory_kind Kind, typename Context = any_context>
class memory_resource{
public:
void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); }
void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); }
Context get_context(){ return do_get_context(); }
private:
virtual void* do_allocate(size_t n, size_t alignment) = 0;
virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0;
virtual void do_get_context() = 0;
};
The purpose of the Context
template parameter is to allow for more generic allocation semantics. For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:
struct stream_context{
cuda::stream_view s;
};
template <memory_kind Kind>
class stream_bound_memory_resource : public cuda::memory_resource<Kind, stream_context>{
public:
stream_bound_memory_resource(cuda::stream_view s) : s_{s} {}
private:
void* do_allocate(size_t n, size_t alignment) override { // always allocate on `s` }
void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` }
stream_context do_get_context(){ return s_; }
stream_context s_;
};
cuda::pmr_adaptor
cuda::memory_resource
is similar to std::pmr::memory_resource
, but they do not share a common inheritance hierarchy, therefore an object that derives from cuda::memory_resource
cannot be used polymorphically as a std::pmr::memory_resource
, i.e., a cuda::memory_resource
derived type cannot be passed to a function that expects a std::pmr::memory_resource
pointer or reference. However, there may be situations where one wishes to use a cuda::memory_resource
derived type as if it were a std::pmr::memory_resource
derived type. The cuda::pmr_adaptor
class is intended to provide this functionality by inheriting from std::pmr::memory_resource
and adapting an appropriate cuda::memory_resource
.
cuda::stream_ordered_memory_resource
The cuda::stream_ordered_memory_resource
class template is the abstract base class interface for stream-ordered memory allocation. This is similar to cuda::memory_resource
but allocate_async
and deallocate_async
both take a stream argument and follow stream-ordered memory allocation semantics as defined by cudaMallocAsync
.
template <memory_kind Kind>
class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */>
{
public:
static constexpr size_t default_alignment = alignof(max_align_t);
// Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
void* allocate_async(size_t n, cuda::stream_view s){ return do_allocate_async(n, default_alignment, s); }
void* allocate_async(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate_async(n, alignment, s); }
void deallocate_async(void* p, size_t n, cuda::stream_view s){ return do_deallocate_async(p, n, default_alignment, s); }
void deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate_async(p, n, alignment, s); }
private:
virtual void* do_allocate_async(size_t n, size_t alignment, cuda::stream_view s) = 0;
virtual void do_deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
};
Concrete Resource Implementations:
Just as <memory_resource>
provides concrete, derived implementations of std::pmr::memory_resource
, libcu++ will provide the following:
-
cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>
- Uses
::operator new()
/::operator delete()
for allocating host memory
- Uses
-
cuda::device_resource : public cuda::memory_resource<memory_kind::device>
- Uses
cudaMalloc/cudaFree
for allocating device memory
- Uses
-
cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>
- Uses
cudaMallocManaged/cudaFree
for unified memory
- Uses
-
cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>
- Uses
cudaMallocHost/cudaFreeHost
for page-locked host memory
- Uses
-
cuda::async_device_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>
- Uses
cudaMallocAsync/cudaFreeAsync
for device memory
- Uses
Other resource implementations may be added as deemed appropriate.
cuda::
Namespace Policy
The current policy of libcu++ is that everything in the cuda::
namespace must be heterogeneous, i.e., __host__ __device__
. The facilities described above in <cuda/memory_resource>
are intended to be host-only at this time. Therefore, we propose to modify the policy to allow host-only constructs in cuda::
. Device-only constructs will still be disallowed in cuda::
. Any device-only construct would go into cuda::device::
.
Future Work
Future work will include the design of allocators similar to std::pmr::polymorphic_allocator
to work with cuda::memory_resource
and cuda::stream_ordered_memory_resource
.
Likewise, containers that work with cuda::memory_resource
and cuda::stream_ordered_memory_resource
will be future work.
There's a usability issue with the current design of memory_kind
and templating memory_resource
on memory_kind
.
Let's say I have a function that expects a resource that can allocate memory accessible from the device. For the sake of argument, let's say I don't care if it's managed, regular device memory, or pinned. All I care about is being accessible from the device.
I might write my function like:
void foo(cuda::memory_resource<memory_kind::device>* mr);
But let's say I had a managed memory resource like:
struct managed_memory_resource : public cuda::memory_resource<memory_kind::unified>;
I wouldn't be able to pass a pointer to an instance of managed_memory_resource
to foo
even though it satisfies my requirements of allocating device accessible memory.
Furthermore, it wouldn't be appropriate to have managed_resource
derive from both memory_resource<memory_kind::device>
and memory_resource<memory_kind::unified>
. That would be lying about what kind of memory it allocates, which can be important if the properties of the memory are different, e.g., you can't use IPC with memory_kind::unified
today.
The final alternative might be to make it a template (or 3 explicit overloads):
template <memory_kind K>
enable_if_t<K==device or K==managed or K==pinned, void> foo(cuda::memory_resource<K>* mr);
This works, but if I want foo
to be an API for a non-header-only library that builds a binary, then I'd have to explicitly instantiate foo
for all 3 memory_kind
s, which triples the size of my binary.
The current design is too rigid. As @harrism put it, the author of a memory resource wants to be able to specify the memory_kind
it allocates, but a user of a memory resource would like the option to specify the accessibility without over-specifying the memory_kind
.
An alternative option is to add something I call resource_view
which deals in capabilities in addition to memory_resource
which deals in kinds. Adding a common base class to all resources (with common execution context), making it private and exposing it through the (friend) class resource_view would create a mechanism alternative to inheritance. The compile-time part of this would be a template constructor of a resource view.
enum class memory_caps : unsigned {
host_accessible = 0x01,
device_accessible = 0x02,
cross_device_accessible = 0x04,
oversubscribable = 0x08,
resident = 0x10,
has_host_storage = 0x20,
has_device_storage = 0x40,
};
constexpr memory_caps operator|(memory_caps a, memory_caps b) {
return (memory_caps)((unsigned)a|(unsigned)b);
}
constexpr memory_caps operator&(memory_caps a, memory_caps b) {
return (memory_caps)((unsigned)a&(unsigned)b);
}
enum class memory_kind {
host,
device,
pinned,
managed
};
template <memory_kind kind>
struct memory_kind_caps;
struct base_resource {
virtual void do_my_job() = 0;
};
template <memory_caps>
struct resource_view;
template <memory_kind _kind>
struct memory_resource : private base_resource {
static constexpr memory_kind kind = _kind;
void my_job() { do_my_job(); }
template <memory_caps view_caps>
friend class resource_view;
};
template <typename resource>
struct memory_resource_traits {
static constexpr memory_caps caps = memory_kind_caps<resource::kind>::value;
static constexpr memory_kind kind = resource::kind;
};
template <memory_caps caps>
struct resource_view {
template <memory_kind kind, typename = std::enable_if_t<(caps & memory_resource_traits<memory_resource<kind>>::caps) == caps>>
resource_view(memory_resource<kind> *resource) {
this->resource = resource;
}
template <memory_caps other_caps,
typename = std::enable_if_t<(other_caps & caps) == caps>>
resource_view(resource_view<other_caps> other) {
this->resource = resource;
}
void my_job() {
resource->do_my_job();
}
private:
base_resource *resource;
};
template <>
struct memory_kind_caps<memory_kind::host> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::oversubscribable |
memory_caps::has_host_storage> {};
template <>
struct memory_kind_caps<memory_kind::pinned> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::device_accessible |
memory_caps::resident |
memory_caps::has_host_storage> {};
template <>
struct memory_kind_caps<memory_kind::device> : std::integral_constant<
memory_caps,
memory_caps::device_accessible |
memory_caps::resident |
memory_caps::has_device_storage>
{};
template <>
struct memory_kind_caps<memory_kind::managed> : std::integral_constant<
memory_caps,
memory_caps::host_accessible |
memory_caps::device_accessible |
memory_caps::oversubscribable |
memory_caps::has_host_storage |
memory_caps::has_device_storage>
{};
This structure allows the client code which cares for a specific combination of capabilities to declare this by expecting a resource_view<caps>
in the interface. This is important, as it simplifies binary distribution of library code using this interface. Also, since the capabilities are mere flags, adding new ones in the future does not break API or ABI.
Usage example:
#include <iostream>
struct managed_resource : memory_resource<memory_kind::managed> {
private:
void do_my_job() override { std::cout << "Managed" << std::endl; }
};
struct device_resource : memory_resource<memory_kind::device> {
private:
void do_my_job() override { std::cout << "Device" << std::endl; }
};
void foo(resource_view<memory_caps::host_accessible | memory_caps::has_host_storage> view) {
view.my_job();
}
void goo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
view.my_job();
}
void boo(resource_view<memory_caps::device_accessible> view) {
view.my_job();
}
void loo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
boo(view);
}
int main() {
managed_resource mr;
device_resource dr;
mr.my_job();
foo(&mr);
goo(&mr);
loo(&mr);
//foo(&dr); // compilation error
//goo(&dr); // compilation error
return 0;
}
fyi, this has become incredibly out of date with the current design. I will be working on updating the design document.