CUDA `<memory_resource>` Extension Proposal

PR Tracker

[ ] https://github.com/NVIDIA/libcudacxx/pull/105 Implement memory_resource, stream_ordered_memory_resource, stream_view, and memory_kind
[ ] TODO Implement derived resource implementations
[ ] TODO Design/Implement allocator for cuda::memory_resource
[ ] TODO Design/Implement allocator/allocator_traits for cuda::stream_ordered_memory_resource

Motivation

Performance-sensitive applications that make frequent dynamic memory allocations often find allocating memory to be a significant overhead. CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of cudaMalloc/cudaFree compared to standard malloc/free. As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. Interoperation among these applications is difficult without an interface to enable sharing a common allocator.

In Standard C++, Allocator has traditionally provided this common interface. C++17 introduced <memory_resource> and the std::pmr::memory_resource abstract class that defines a minimal interface for (de)allocating raw bytes and sits below Allocator. This optionally polymorphic interface provides a standard way to define, expose and share custom memory allocation.

However, the std::pmr::memory_resource interface is insufficient to capture the unique features of the CUDA C++ programming model. For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications may access at least four different kinds of dynamically allocated memory. Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of cudaMallocAsync and cudaFreeAsync¹, which provide stream-ordered memory allocation and deallocation. Therefore, there is a need for a common allocator interface similar to std::pmr::memory_resource that accounts for the unique features of CUDA C++.

[1]: Note that cudaMallocAsync obviates neither the need for custom, CUDA-aware allocators, nor the need for a common allocation interface. There will never be one allocator that satisfies all users. Furthermore, a common interface allows composing and layering utilities like logging, limiting, leak checking, and tracking.

Description

We propose extending <memory_resource> to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.

We chose <memory_resource> as the basis for a CUDA-specific allocator interface for several reasons:

<memory_resource> is the direction taken by Standard C++ for custom, stateful allocators. An allocator interface with a common look and feel will ease working between Standard and CUDA C++. For more information on <memory_resource> see here and here.
The RAPIDS Memory Management library has had three years of success using its rmm::device_memory_resource interface based on std::pmr::memory_resource.
Likewise, Thrust has had similar success with its thrust::mr::memory_resource interface.

Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, <memory_resource> is the logical choice.

This proposal includes the addition of the following to libcu++:

`cuda::memory_kind`

A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory. This is intended to be similar to the existing thread_scope enum.

enum class memory_kind {
  device,  ///< Device memory accessible only from device
  unified, ///< Unified memory accessible from both host and device
  pinned,  ///< Page-locked system memory accessible from both host and device
  host     ///< System memory only accessible from host code
};

`cuda::stream_view`

Similar to std::span or std::string_view, cuda::stream_view is a strongly typed, non-owning, view type for cudaStream_t. This type provides a more type-safe C++ wrapper around cudaStream_t and serves as the input argument type for any libcu++ API that takes a CUDA stream.

`cuda::memory_resource`

The cuda::memory_resource class template is the abstract base class interface akin to std::pmr::memory_resource with two main differences:

The Kind template parameter determines the memory_kind allocated by the resource.
The Context template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization. By default, the Context is the any_context tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.

/**
 * @brief Tag type for the default context of `memory_resource`.
 *
 * Default context in which storage may be used immediately on any thread or any
 * CUDA stream without synchronization.
 */
struct any_context{};

template <memory_kind Kind, typename Context = any_context>
class memory_resource{
public:
   void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); }
   void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); }
   Context get_context(){ return do_get_context(); }
private:
   virtual void* do_allocate(size_t n, size_t alignment) = 0;
   virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0;
   virtual void do_get_context() = 0;
};

The purpose of the Context template parameter is to allow for more generic allocation semantics. For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:

struct stream_context{
    cuda::stream_view s;
};

template <memory_kind Kind>
class stream_bound_memory_resource : public cuda::memory_resource<Kind, stream_context>{
public:
   stream_bound_memory_resource(cuda::stream_view s) : s_{s} {}
private:
   void* do_allocate(size_t n, size_t alignment)  override  { // always allocate on `s` }
   void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` }
   stream_context do_get_context(){ return s_; }
   stream_context s_;
};

`cuda::pmr_adaptor`

cuda::memory_resource is similar to std::pmr::memory_resource, but they do not share a common inheritance hierarchy, therefore an object that derives from cuda::memory_resource cannot be used polymorphically as a std::pmr::memory_resource, i.e., a cuda::memory_resource derived type cannot be passed to a function that expects a std::pmr::memory_resource pointer or reference. However, there may be situations where one wishes to use a cuda::memory_resource derived type as if it were a std::pmr::memory_resource derived type. The cuda::pmr_adaptor class is intended to provide this functionality by inheriting from std::pmr::memory_resource and adapting an appropriate cuda::memory_resource.

`cuda::stream_ordered_memory_resource`

The cuda::stream_ordered_memory_resource class template is the abstract base class interface for stream-ordered memory allocation. This is similar to cuda::memory_resource but allocate_async and deallocate_async both take a stream argument and follow stream-ordered memory allocation semantics as defined by cudaMallocAsync.

template <memory_kind Kind>
class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */>
{
public:
    static constexpr size_t default_alignment = alignof(max_align_t);
    // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
    void* allocate_async(size_t n, cuda::stream_view s){ return do_allocate_async(n, default_alignment, s); }
    void* allocate_async(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate_async(n, alignment, s); }
    void deallocate_async(void* p, size_t n, cuda::stream_view s){ return do_deallocate_async(p, n, default_alignment, s); }
    void deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate_async(p, n, alignment, s); }
 private:
    virtual void* do_allocate_async(size_t n, size_t alignment, cuda::stream_view s) = 0;
    virtual void do_deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
};

Concrete Resource Implementations:

Just as <memory_resource> provides concrete, derived implementations of std::pmr::memory_resource, libcu++ will provide the following:

cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>
- Uses ::operator new()/::operator delete() for allocating host memory
cuda::device_resource : public cuda::memory_resource<memory_kind::device>
- Uses cudaMalloc/cudaFree for allocating device memory
cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>
- Uses cudaMallocManaged/cudaFree for unified memory
cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>
- Uses cudaMallocHost/cudaFreeHost for page-locked host memory
cuda::async_device_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>
- Uses cudaMallocAsync/cudaFreeAsync for device memory

Other resource implementations may be added as deemed appropriate.

`cuda::` Namespace Policy

The current policy of libcu++ is that everything in the cuda:: namespace must be heterogeneous, i.e., __host__ __device__. The facilities described above in <cuda/memory_resource> are intended to be host-only at this time. Therefore, we propose to modify the policy to allow host-only constructs in cuda::. Device-only constructs will still be disallowed in cuda::. Any device-only construct would go into cuda::device::.

Future Work

Future work will include the design of allocators similar to std::pmr::polymorphic_allocator to work with cuda::memory_resource and cuda::stream_ordered_memory_resource.

Likewise, containers that work with cuda::memory_resource and cuda::stream_ordered_memory_resource will be future work.

Mar 01 '21 20:03 jrhemstad

There's a usability issue with the current design of memory_kind and templating memory_resource on memory_kind.

Let's say I have a function that expects a resource that can allocate memory accessible from the device. For the sake of argument, let's say I don't care if it's managed, regular device memory, or pinned. All I care about is being accessible from the device.

I might write my function like:

void foo(cuda::memory_resource<memory_kind::device>* mr);

But let's say I had a managed memory resource like:

struct managed_memory_resource : public cuda::memory_resource<memory_kind::unified>;

I wouldn't be able to pass a pointer to an instance of managed_memory_resource to foo even though it satisfies my requirements of allocating device accessible memory.

Furthermore, it wouldn't be appropriate to have managed_resource derive from both memory_resource<memory_kind::device> and memory_resource<memory_kind::unified>. That would be lying about what kind of memory it allocates, which can be important if the properties of the memory are different, e.g., you can't use IPC with memory_kind::unified today.

The final alternative might be to make it a template (or 3 explicit overloads):

template <memory_kind K>
enable_if_t<K==device or K==managed or K==pinned, void> foo(cuda::memory_resource<K>* mr);

This works, but if I want foo to be an API for a non-header-only library that builds a binary, then I'd have to explicitly instantiate foo for all 3 memory_kinds, which triples the size of my binary.

The current design is too rigid. As @harrism put it, the author of a memory resource wants to be able to specify the memory_kind it allocates, but a user of a memory resource would like the option to specify the accessibility without over-specifying the memory_kind.

Mar 04 '21 15:03 jrhemstad

An alternative option is to add something I call resource_view which deals in capabilities in addition to memory_resource which deals in kinds. Adding a common base class to all resources (with common execution context), making it private and exposing it through the (friend) class resource_view would create a mechanism alternative to inheritance. The compile-time part of this would be a template constructor of a resource view.

enum class memory_caps : unsigned {
    host_accessible           = 0x01,
    device_accessible         = 0x02,
    cross_device_accessible   = 0x04,
    oversubscribable          = 0x08,
    resident                  = 0x10,
    has_host_storage          = 0x20,
    has_device_storage        = 0x40,
};

constexpr memory_caps operator|(memory_caps a, memory_caps b) {
    return (memory_caps)((unsigned)a|(unsigned)b);
}

constexpr memory_caps operator&(memory_caps a, memory_caps b) {
    return (memory_caps)((unsigned)a&(unsigned)b);
}

enum class memory_kind {
    host,
    device,
    pinned,
    managed
};

template <memory_kind kind>
struct memory_kind_caps;

struct base_resource {
    virtual void do_my_job() = 0;
};

template <memory_caps>
struct resource_view;

template <memory_kind _kind>
struct memory_resource : private base_resource {
    static constexpr memory_kind kind = _kind;
    void my_job() { do_my_job(); }

    template <memory_caps view_caps>
    friend class resource_view;
};

template <typename resource>
struct memory_resource_traits {
    static constexpr memory_caps caps = memory_kind_caps<resource::kind>::value;
    static constexpr memory_kind kind = resource::kind;
};

template <memory_caps caps>
struct resource_view {
    template <memory_kind kind, typename = std::enable_if_t<(caps & memory_resource_traits<memory_resource<kind>>::caps) == caps>>
    resource_view(memory_resource<kind> *resource) {
          this->resource = resource;
    }

    template <memory_caps other_caps,
              typename = std::enable_if_t<(other_caps & caps) == caps>>
        resource_view(resource_view<other_caps> other) {
        this->resource = resource;
    }

    void my_job() {
        resource->do_my_job();
    }
private:
    base_resource *resource;
};

template <>
struct memory_kind_caps<memory_kind::host> : std::integral_constant<
        memory_caps,
        memory_caps::host_accessible |
        memory_caps::oversubscribable |
        memory_caps::has_host_storage> {};

template <>
struct memory_kind_caps<memory_kind::pinned> : std::integral_constant<
        memory_caps,
        memory_caps::host_accessible |
        memory_caps::device_accessible |
        memory_caps::resident |
        memory_caps::has_host_storage> {};

template <>
struct memory_kind_caps<memory_kind::device> : std::integral_constant<
        memory_caps,
        memory_caps::device_accessible |
        memory_caps::resident |
        memory_caps::has_device_storage>
{};

template <>
struct memory_kind_caps<memory_kind::managed> : std::integral_constant<
        memory_caps,
        memory_caps::host_accessible |
        memory_caps::device_accessible |
        memory_caps::oversubscribable |
        memory_caps::has_host_storage |
        memory_caps::has_device_storage>
{};

This structure allows the client code which cares for a specific combination of capabilities to declare this by expecting a resource_view<caps> in the interface. This is important, as it simplifies binary distribution of library code using this interface. Also, since the capabilities are mere flags, adding new ones in the future does not break API or ABI.

Usage example:

#include <iostream>

struct managed_resource : memory_resource<memory_kind::managed> {
private:
    void do_my_job() override { std::cout << "Managed" << std::endl; }
};

struct device_resource : memory_resource<memory_kind::device> {
private:
    void do_my_job() override { std::cout << "Device" << std::endl; }
};

void foo(resource_view<memory_caps::host_accessible | memory_caps::has_host_storage> view) {
    view.my_job();
}

void goo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
    view.my_job();
}

void boo(resource_view<memory_caps::device_accessible> view) {
    view.my_job();
}

void loo(resource_view<memory_caps::device_accessible | memory_caps::oversubscribable> view) {
    boo(view);
}

int main() {
    managed_resource mr;
    device_resource dr;
    mr.my_job();
    foo(&mr);
    goo(&mr);
    loo(&mr);
    //foo(&dr);  // compilation error
    //goo(&dr);  // compilation error
    return 0;
}

Mar 07 '21 10:03 mzient

fyi, this has become incredibly out of date with the current design. I will be working on updating the design document.

Oct 28 '21 00:10 jrhemstad

libcudacxx
libcudacxx copied to clipboard

CUDA <memory_resource> Extension Proposal

CUDA `<memory_resource>` Extension Proposal

PR Tracker

Motivation

Description

`cuda::memory_kind`

`cuda::stream_view`

`cuda::memory_resource`

`cuda::pmr_adaptor`

`cuda::stream_ordered_memory_resource`

Concrete Resource Implementations:

`cuda::` Namespace Policy

Future Work

libcudacxx libcudacxx copied to clipboard

CUDA <memory_resource> Extension Proposal

CUDA <memory_resource> Extension Proposal

PR Tracker

Motivation

Description

cuda::memory_kind

cuda::stream_view

cuda::memory_resource

cuda::pmr_adaptor

cuda::stream_ordered_memory_resource

Concrete Resource Implementations:

cuda:: Namespace Policy

Future Work

libcudacxx
libcudacxx copied to clipboard

CUDA `<memory_resource>` Extension Proposal

`cuda::memory_kind`

`cuda::stream_view`

`cuda::memory_resource`

`cuda::pmr_adaptor`

`cuda::stream_ordered_memory_resource`

`cuda::` Namespace Policy