Option to skip event recording on stream-ordered deallocation
On this line we always record an event during a deallocation: https://github.com/rapidsai/rmm/blob/9d6c0f179578b42ed9627027879ec412745f85c3/cpp/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp#L254
From my NVTX profiling on an RTX4000 Ada this looks like about 1/3 to 1/2 the cost of deallocations when using a pool memory resource.
Based on the note and scanning the code, it seems this is useful to allow other streams to access memory freed on a given stream, or perhaps for the PTDS case mentioned?
Due to the extra cost, in code with lots of allocations/deallocations, one is incentivized to try to make sure these event recordings happen while the GPU is saturated. This will hide the extra cost, but can make the code less natural/expressive. For instance if we have a few short kernels that use temporary buffers, we might want to defer releasing those buffers until later when some longer kernels have been launched. And we will want to release all our memory before syncing the stream, which might be before those objects would naturally go out of scope, or might not be possible if memory is used conditionally via code executed on the host (after sync). For example, my use case is a monte carlo algorithm which expresses this conditional behavior.
I'm not sure if you already have a plan for this TODO, but if not, would it be possible to have a toggle where we can instead defer these event recordings until they are needed? Presumably this would make sharing memory between streams less efficient, but allocating/deallocating memory within a stream more efficient. Another option could be to add an API that takes a pre-recorded event on the associated stream? That would allow frees of multiple memory buffers to only incur a single event recording. This seems pretty clunky in comparison though, and might also incentivize the user to arrange deallocations unnaturally.
For context, I'd like to switch to RMM over our legacy in-house memory pool, due to its versatility, stream-ordered semantics, and better alloc/dealloc speed. It would make the case more compelling if the performance difference between our pool and RMM's was clearer and required less careful arrangement of the deallocations.
I agonized over the cost of these event recordings for a long time when I originally wrote this code and later fixed race conditions that these events solve. I am no longer working on RMM, but I would advise against adding too much complexity here. My suggestions:
-
Use
cuda_async_memory_resourcewhich uses CUDA's built-in pool allocator. This was written after RMM's pool, and is sometimes slower than RMM's pool, but it handles fragmentation better, and hey, it's the officially supported CUDA API, so I recommend it. -
If you want to batch up deallocations of multiple buffers at once, consider layering memory resources. You could add what's known as a "monotonic memory resource" that pre-allocates a pool from an upstream, suballocates from the pool, but never deletes individual suballocations. This MR's deallocate() is a no-op. The memory is only freed when the memory resource itself is deleted. So use the
pool_memory_resourceorcuda_async_memory_resource(or anything else, for that matter) as the upstream for your monotonic MRs, and create one monotonic MR for everything you want to batch-delete.
In other words, rather than adding behavioral and interface complexity to the RMM components, build up that complexity through RMM's composability.
Mark
To echo Mark's comments. Additionally, if you go with option (1) and find that for your workloads it is slower than RMM's pool_memory_resource, please report an issue so that we can take these workloads to the CUDA memory system folks.
For the monotonic allocator, often also called a bump allocator, maybe we should introduce one in RMM. Here's a simple example that uses a fixed size bump allocator. So if you have a known bounded size up front for your temporary allocations, you can create a bump_allocator(upstream_mr, ...) and just use that for the temporaries. One note of caution: RMM always promises that allocations are 256 byte aligned, so the bounded size you need is std::accumulate(sizes.begin(), sizes.end(), std::size_t{0}, [](std::size_t s) { return rmm::align_up(s, rmm::CUDA_ALLOCATION_ALIGNMENT); });
#include <mutex>
#include <numeric>
#include <rmm/aligned.hpp>
#include <rmm/cuda_stream.hpp>
#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_buffer.hpp>
#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/resource_ref.hpp>
template <typename Upstream>
class bump_allocator final : public rmm::mr::device_memory_resource {
public:
explicit bump_allocator(rmm::device_async_resource_ref upstream,
std::size_t arena_size, rmm::cuda_stream_view stream)
: upstream_{upstream},
arena_size_{rmm::align_up(arena_size, rmm::CUDA_ALLOCATION_ALIGNMENT)} {
std::lock_guard lock{mtx_};
arena_ = get_upstream_resource().allocate_async(arena_size_, 1, stream);
// If you want this to be usable on any stream, need to stream.synchronize()
// here. If you only care about usage on `stream`, then it doesn't matter.
}
explicit bump_allocator(Upstream *upstream, std::size_t arena_size,
rmm::cuda_stream_view stream)
: bump_allocator(rmm::device_async_resource_ref{upstream}, arena_size,
stream) {}
void reset() {
// Forget about any allocations. Danger, must sync all outstanding
// work. In case we want to reuse these allocations.
std::lock_guard lock{mtx_};
offset_ = 0;
}
void release(rmm::cuda_stream_view stream) {
std::lock_guard<std::mutex> lock{mtx_};
if (arena_ != nullptr) {
// If the allocations were made on any stream need to sync those
// streams before calling release. If streams are created with
// cudaStreamDefault (rather than cudaStreamNonBlocking) then it
// suffices to pass rmm::cuda_stream_default.
get_upstream_resource().deallocate_async(arena_, arena_size_, stream);
arena_ = nullptr;
}
}
bump_allocator() = delete;
bump_allocator(bump_allocator const &) = delete;
bump_allocator(bump_allocator &&) = delete;
bump_allocator &operator=(bump_allocator const &) = delete;
bump_allocator &operator=(bump_allocator &&) = delete;
~bump_allocator() override { release(rmm::cuda_stream_default); }
[[nodiscard]] rmm::device_async_resource_ref
get_upstream_resource() const noexcept {
return upstream_;
}
private:
[[nodiscard]] void *do_allocate(std::size_t bytes,
rmm::cuda_stream_view stream) override {
// RMM always promises to return pointers that have at least CUDA_ALLOCATION_ALIGNMENT.
// Since the arena allocation has this alignment, the simplest
// approach is just to make all suballocations have that alignment
// too.
auto const size = rmm::align_up(bytes, rmm::CUDA_ALLOCATION_ALIGNMENT);
std::lock_guard lock{mtx_};
RMM_EXPECTS(offset_ + size <= arena_size_,
"No space left in bump allocator");
void *ptr = reinterpret_cast<void *>(
reinterpret_cast<std::uintptr_t>(arena_) + offset_);
offset_ += size;
return ptr;
}
void do_deallocate(void *ptr, std::size_t bytes, rmm::cuda_stream_view) override {}
private:
rmm::device_async_resource_ref upstream_;
std::size_t arena_size_;
mutable std::mutex mtx_;
void *arena_{nullptr};
std::ptrdiff_t offset_{0};
};
int main(void) {
auto base = rmm::mr::cuda_memory_resource{};
auto bumper = bump_allocator(&base, 1024, rmm::cuda_stream_default);
rmm::cuda_stream_default.synchronize();
{
rmm::device_buffer buf(256, rmm::cuda_stream{}, bumper);
rmm::device_buffer buf2(256, rmm::cuda_stream{}, bumper);
rmm::device_buffer buf3(256, rmm::cuda_stream{}, bumper);
rmm::device_buffer buf4(255, rmm::cuda_stream{}, bumper);
// This will fail because the alignment requirement means buf4
// actually pulled 256 bytes from upstream.
rmm::device_buffer buf5(1, rmm::cuda_stream{}, bumper);
}
rmm::cuda_stream_default.synchronize();
return 0;
}
Thank you to both Mark and Lawrence for the quick replies.
Option 1 is indeed a bit slower than our existing in-house memory pool, so I moved on from that idea. What information would be needed to report an issue? I'm not sure how much I can provide.
I have been playing around with a monotonic buffer resource based on the stdlib version, but with forced 256 alignment and stream awareness. It took a little to work out the kinks, mainly with the alignment, but have it working now. It seems to provide the desired benefit over the deallocation cost of the standard pool, although it is tricky to use in a mixed-stream and mixed-scope environment. Currently I'm just limiting it to the most significant stream and falling back to the pool otherwise. The main drawback is the all-at-once deallocation - one must be careful not to use it with any buffers that outlive the monotonic resource. We have a few lazily created/updated persistent buffers along with the more well-scoped ones. This means that we can't simply set the default memory resource but will instead need to explicitly request memory resources for different containers (or maybe we can get away with smaller scopes + default memory resource). It does seem workable, perhaps it can be improved further with the right design on our end.
The implementation you provided looks nice and simple, but it is nice to have an exponentially increasing series of allocations instead of a single one so you don't need to have a strict max size. If the upstream is a pool resource then the ~log(n) allocations aren't much problem.
Option 1 is indeed a bit slower than our existing in-house memory pool, so I moved on from that idea. What information would be needed to report an issue? I'm not sure how much I can provide.
Apologies for the late followup here.
One thing would be to provide a log of the allocation pattern in RMM (you can do this by wrapping your memory resource in a LoggingResourceAdaptor). This can then be used to replay the allocation patterns. That might be helpful.
I can certainly believe that if you know more about your allocation and usage pattern that a specific allocator might nonetheless perform better than the general purpose driver implementation.
In terms of management of memory resources, my strong recommendation is to always be explicitly about the memory resource you are using (rather than relying on rmm::set/get_current_device_resource).
The implementation you provided looks nice and simple, but it is nice to have an exponentially increasing series of allocations instead of a single one so you don't need to have a strict max size. If the upstream is a pool resource then the ~log(n) allocations aren't much problem.
You might have luck with the arena_memory_resource that splits allocations into per-stream arenas (so no eventrecord/wait on deallocation) and offers a synchronising step to "defragment" after a phase of allocations.