rmm
rmm copied to clipboard
[BUG] Fragmentation with cuda_async_memory_resource
Describe the bug rmm::mr::cuda_async_memory_resource showing signs of pretty severe fragmentation after allocating / deallocating ~300GB on an 80GB GPU. I was able to perform a series of allocations, all on a single thread, using the the synchronous cuda_memory_resource without running out of memory. It runs out of memory when there are ~30GB of memory that are available but not able to allocate ~110MB.
cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used ); Gives a usage of 53514136688
cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used ); Shows a reservation of 84490059776
A tracking_resource_adaptor.get_allocated_bytes() shows 53514136688
I am failing to allocate 132870728 bytes
Steps/Code to reproduce bug rmm_test.zip There is now a simpler test found in rmm_simple_test.cpp. It just performs allocate and deallocate.
I have attached a Test which is similar to rmm_replay except that when it runs out of memory it first synchronizes and tries again, if it still fails it queues the allocation and the frees for those corresponding allocations until we free up 8x the last allocation failure. At which point it resumes by pushing the queued allocations to the front of the allocation list and picking back up. If the amount of allocations that gets queued grows above a certain threshold, currently set at 20GB using kMaxAllocationBytesQueue, then it just fails and stops trying. The test consists of a single cpp, a cmake file, and the rmm_log file which I am replaying to generate the error.
The binary is launched using
./rmm_test {path to log file} {cuda_async | bin_small_async_large}
When it fails to allocate it will output the measurements I showed above in the bug description section.
Expected behavior I would expect the allocator to be able to handle fragmentation particularly when so much of the pool is still free. I also expect that it should still be able to handle fragmentation under the hood. From this blog We can read the text:
If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.
Due to this I would expect there not to be fragmentation issues. With my current use case
**Environment details **
- Environment location: Bare Metal
- Method of RMM install: conda
Additional context There are two variants I tried using cuda_async_memory_resource. One that bins all allocations smaller than 4194304 with a pool_memory_resource<cuda_memory_resource> and a cuda_async_memory_resource for larger allocations and one that just uses cuda_async_memory_resource for all allocations.
I figured removing the smaller allocations would reduce the incidence of fragmentation. This doesn't seem to have been enough. I am trying to figure out if there is anything I can do at run time to make sure the pool gets defragmented.
when it runs out of memory it first synchronizes and tries again, if it still fails it queues the allocation and the frees for those corresponding allocations until we free up 8x the last allocation failure. At which point it resumes by pushing the queued allocations to the front of the allocation list and picking back up. If the amount of allocations that gets queued grows above a certain threshold, currently set at 20GB using kMaxAllocationBytesQueue, then it just fails and stops trying.
Can you reproduce this behavior without all the extra logic? The extra logic makes it difficult to isolate where the problem is.
It would be ideal if you could provide a simple sequence of allocations/frees that succeed with cudaMalloc
but fail with cudaMallocAsync
.
Ok I felt like we needed to give it a chance to free up memory before continuing. This zip has an extra file and target called rmm_simple_test rmm_test.zip
@jrhemstad to be clear. Is this test simple enough or do you want to see this behaviour WITHOUT rmm?
@jrhemstad to be clear. Is this test simple enough or do you want to see this behaviour WITHOUT rmm?
If you wouldn't mind isolating it to remove RMM that would make it easier to forward to the internal teams to take a look. That could also expose that the error may be in RMM as opposed to cudaMallocAsync itself.
Here is a new group of tests. rmm_test.zip
The one without rmm is called async_malloc_test.cpp This one also fails using cudaMallocAsync directly
For reference, can you let me know what CUDA/driver version you're using?
Filed internal bug# 4072461
I'll report back when we hear something from the driver team.
nvidia-smi
Fri Apr 14 17:58:50 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:27:00.0 Off | 0 |
| N/A 34C P0 66W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
mamba list | grep cuda
cuda-python 11.8.1 py310h01a121a_2 conda-forge
cudatoolkit 11.8.0 h37601d7_11 conda-forge
the toolkit in conda does not match the driver I can try compiling outside of the conda environment if that might change something.
It looks like from the log that allocation is happening from multiple threads. To be clear, this reproduces on a single thread, right?
The log was made from multiple threads. This reproducer utilizes a single thread to perform all of the same allocations.
Thanks @felipeblazing. And to double check, synchronizing the stream and retrying the allocation still fails, right?
Yes.
@felipeblazing do you get different results with pool_memory_resource
or arena_memory_resource
?
The driver team took a look into this and they found that by synchronizing the stream and retrying the allocation that it succeeded. Here is the modified version of the repro they tested:
82 while(!allocations.empty()){
83
84 auto allocation = std::move(allocations.front());
85 allocations.pop();
86
87 if(allocation.allocate && success){
88
89 allocation_ptr_map[allocation.allocation_ptr] = nullptr;
90 bool tried = false;
91 retry:
92 cudaError_t result;
93 if(run_async){
94 if(alloc_from_pool){
95 result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
96 }else{
97 result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
98 }
99
100 }else{
101 result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
102 }
103 if(cudaSuccess != result){
104 std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl;
105
106 cudaStreamSynchronize(0);
107
108 size_t size_used;
109 cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
110 std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl;
111
112 cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used );
113 std::cout<<"Mem used is= "<<size_used<<std::endl;
114
115 // auto error = cudaMemPoolTrimTo ( memPool,size_used );
116 // cudaStreamSynchronize(0);
117 // if(error != cudaSuccess){
118 // std::cout<<"failed to trim"<<std::endl;
119
120 // }else{
121
122 // cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
123 // std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl;
124 // }
125
126 if (tried) {
127 return false;
128 } else {
129 tried = true;
130 goto retry;
131 }
132
133 }
@felipeblazing do you get different results with
pool_memory_resource
orarena_memory_resource
?
Arena memory resource does worst than pool (it makes the least progress in the allocation list). Pool and cuda_async_memory_resource are not too far off from one another.
The driver team took a look into this and they found that by synchronizing the stream and retrying the allocation that it succeeded. Here is the modified version of the repro they tested:
82 while(!allocations.empty()){ 83 84 auto allocation = std::move(allocations.front()); 85 allocations.pop(); 86 87 if(allocation.allocate && success){ 88 89 allocation_ptr_map[allocation.allocation_ptr] = nullptr; 90 bool tried = false; 91 retry: 92 cudaError_t result; 93 if(run_async){ 94 if(alloc_from_pool){ 95 result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0); 96 }else{ 97 result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0); 98 } 99 100 }else{ 101 result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size); 102 } 103 if(cudaSuccess != result){ 104 std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl; 105 106 cudaStreamSynchronize(0); 107 108 size_t size_used; 109 cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used ); 110 std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl; 111 112 cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used ); 113 std::cout<<"Mem used is= "<<size_used<<std::endl; 114 115 // auto error = cudaMemPoolTrimTo ( memPool,size_used ); 116 // cudaStreamSynchronize(0); 117 // if(error != cudaSuccess){ 118 // std::cout<<"failed to trim"<<std::endl; 119 120 // }else{ 121 122 // cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used ); 123 // std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl; 124 // } 125 126 if (tried) { 127 return false; 128 } else { 129 tried = true; 130 goto retry; 131 } 132 133 }
I will try that. One thing I should have mentioned is that I only tried retrying and synching with async_memory_resource not with cudamallocasync directly so that might explain the difference.
~Confirmed that it works if using cudaMallocAsync
/ cudaMallocFromPoolAsync
directly and synchronizing the stream when an allocation fails. Here's the reproducer code I used (tweaked from the snippet you provided):~
#include <memory>
#include <thread>
#include <queue>
#include <fstream>
#include <map>
#include <cuda.h>
#include<cuda_runtime.h>
#include <iostream>
#include <sstream>
struct allocation_free{
std::size_t size;
std::size_t thread;
void * allocation_ptr;
bool allocate;
};
std::queue<allocation_free> ParseAllocations(const std::string & file_path){
std::fstream file;
std::queue<allocation_free> allocations;
file.open(file_path.c_str(),std::ios::in);
if (file.is_open()){
std::string line_str;
std::getline(file, line_str);
while(std::getline(file, line_str)){
allocation_free alloc;
std::string element;
std::istringstream line(line_str);
std::getline(line, element, ',');
alloc.thread = std::stoull(element.c_str());
std::getline(line, element, ',');
std::getline(line, element, ',');
//std::cout<<element<<std::endl;
if(element == "allocate"){
alloc.allocate = true;
}else if(element == "free"){
alloc.allocate = false;
}else{
//ignoring failures
continue;
}
//std::cout<<element<<std::endl;
std::getline(line, element, ',');
alloc.allocation_ptr = (void *) std::stoull(element,nullptr,16);
std::getline(line, element, ',');
alloc.size=std::stoull(element);
allocations.push(alloc);
}
file.close();
}
return allocations;
}
std::queue<allocation_free> allocations;
cudaMemPool_t memPool = nullptr;
cudaError_t trycudaMallocFromPoolAsync(void** ptr, size_t size, cudaMemPool_t memPool, cudaStream_t stream) {
cudaError_t result;
result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
}
return result;
}
cudaError_t trycudaMallocAsync(void** ptr, size_t size, cudaStream_t stream) {
cudaError_t result;
result = cudaMallocAsync(ptr, size, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaMallocAsync(ptr, size, stream);
}
return result;
}
cudaError_t trycudaFreeAsync(void* devPtr, cudaStream_t stream) {
cudaError_t result;
result = cudaFreeAsync(devPtr, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaFreeAsync(devPtr, stream);
}
return result;
}
bool RunAllocations(std::queue<allocation_free> allocations, bool run_async, bool alloc_from_pool) {
std::map<void *, void *> allocation_ptr_map;
std::size_t count = 0;
bool success = true;
size_t bytes_allocated = 0;
size_t bytes_deallocated = 0;
while(!allocations.empty()) {
auto allocation = std::move(allocations.front());
allocations.pop();
if(allocation.allocate && success) {
allocation_ptr_map[allocation.allocation_ptr] = nullptr;
cudaError_t result;
if(run_async) {
if(alloc_from_pool) {
// result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
result = trycudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
} else {
// result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
result = trycudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
}
} else {
result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
}
if(cudaSuccess != result) {
std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl;
success = false;
size_t size_used;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used);
std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
std::cout<<"Mem used is= "<<size_used<<std::endl;
cudaStreamSynchronize(0);
auto error = cudaMemPoolTrimTo(memPool,size_used);
cudaStreamSynchronize(0);
if(error != cudaSuccess) {
std::cout<<"failed to trim"<<std::endl;
} else {
cudaMemPoolGetAttribute (memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used);
std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl;
}
}
bytes_allocated += allocation.size;
} else if(!allocation.allocate) {
cudaError_t result;
if(run_async) {
// result = cudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
result = trycudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
} else {
result = cudaFree(allocation_ptr_map[allocation.allocation_ptr]);
}
if(cudaSuccess != result) {
success = false;
std::cout<<"failed to free memory"<<std::endl;
}
bytes_deallocated += allocation.size;
allocation_ptr_map.erase(allocation.allocation_ptr);
}
count++;
}
auto status = cudaStreamSynchronize(0);
if(status != cudaError::cudaSuccess) {
throw std::runtime_error("Error in cuda");
}
std::cout<<"Bytes Allocated "<<bytes_allocated<<" Bytes Deallocated "<<bytes_deallocated<<std::endl;
return success;
}
int main(int argc, char* argv[]) {
std::vector<std::string> command_line_arguments;
if (argc > 1) {
command_line_arguments.assign(argv + 1, argv + argc);
}
if(argc != 2) {
std::cout<<"./binary file_path"<<std::endl;
return 1;
}
std::string file_path = command_line_arguments[0];
allocations = ParseAllocations(file_path);
// //run synchronous
// RunAllocations(allocations,false,false);
//setup pool
cudaDeviceGetDefaultMemPool(&memPool, 0);
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold);
RunAllocations(allocations,true,false);
RunAllocations(allocations,true,true);
return 0;
}
~Will work on a reproducer for RMM to see if that can be handled as well 😄~
EDIT: This actually fails and I misread the output 🤦
Also, the internal folks wanted to pass along that the TrimTo
call is likely unnecessary:
A process 'A' only needs to call TrimTo in conditions like these: a) another driver, like Vulkan or DirectX, in process 'A' is also trying to allocate memory on the same GPU and failing. Then calling TrimTo might help because CUDA would release unused memory back to the OS, which the other driver could pick up. b) another process 'B', potentially using CUDA (or just about any driver), is trying to allocate memory on the same GPU and failing. Here again calling TrimTo in process 'A' helps for the same reasons as mentioned above. Note here though that this scenario assumes the two processes know about each other and hence can coordinate. It doesn't make much sense for an isolated process - that has no idea what other processes are using the same GPU - to call TrimTo. It might just be hurting its own performance by unnecessarily calling TrimTo.
Also note that in a real use case where you're allocating on multiple threads all using PTDS, you may need to synchronize every thread's PTDS when an allocation fails in order to ensure the maximum amount of memory is available.
Here's an updated reproducer that allows controlling whether to run using cudaMalloc
, cudaMallocAsync
, cudaMallocFromPoolAsync
, or rmm::mr::cuda_async_view_memory_resource::allocate
:
#include <memory>
#include <thread>
#include <queue>
#include <fstream>
#include <map>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <sstream>
#ifdef RMM_SUPPORT
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/cuda_async_view_memory_resource.hpp>
#include <rmm/cuda_stream_view.hpp>
#endif
struct allocation_free{
std::size_t size;
std::size_t thread;
void * allocation_ptr;
bool allocate;
};
std::queue<allocation_free> ParseAllocations(const std::string & file_path){
std::fstream file;
std::queue<allocation_free> allocations;
file.open(file_path.c_str(),std::ios::in);
if (file.is_open()){
std::string line_str;
std::getline(file, line_str);
while(std::getline(file, line_str)){
allocation_free alloc;
std::string element;
std::istringstream line(line_str);
std::getline(line, element, ',');
alloc.thread = std::stoull(element.c_str());
std::getline(line, element, ',');
std::getline(line, element, ',');
//std::cout<<element<<std::endl;
if(element == "allocate"){
alloc.allocate = true;
}else if(element == "free"){
alloc.allocate = false;
}else{
//ignoring failures
continue;
}
//std::cout<<element<<std::endl;
std::getline(line, element, ',');
alloc.allocation_ptr = (void *) std::stoull(element,nullptr,16);
std::getline(line, element, ',');
alloc.size=std::stoull(element);
allocations.push(alloc);
}
file.close();
}
return allocations;
}
std::queue<allocation_free> allocations;
cudaMemPool_t memPool = nullptr;
#ifdef RMM_SUPPORT
rmm::mr::cuda_async_view_memory_resource memory_resource;
#endif
cudaError_t trycudaMallocFromPoolAsync(void** ptr, size_t size, cudaMemPool_t memPool, cudaStream_t stream) {
cudaError_t result;
result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
}
return result;
}
cudaError_t trycudaMallocAsync(void** ptr, size_t size, cudaStream_t stream) {
cudaError_t result;
result = cudaMallocAsync(ptr, size, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaMallocAsync(ptr, size, stream);
}
return result;
}
cudaError_t trycudaFreeAsync(void* devPtr, cudaStream_t stream) {
cudaError_t result;
result = cudaFreeAsync(devPtr, stream);
if (cudaSuccess != result) {
cudaStreamSynchronize(stream);
result = cudaFreeAsync(devPtr, stream);
}
return result;
}
#ifdef RMM_SUPPORT
void* tryRMMallocate(std::size_t bytes, rmm::cuda_stream_view stream = rmm::cuda_stream_view{}) {
try {
return memory_resource.allocate(bytes, stream);
} catch(...) {
cudaStreamSynchronize(stream);
try {
return memory_resource.allocate(bytes, stream);
} catch(...) {
throw std::runtime_error("RMM Failed to allocate");
}
}
}
void tryRMMdeallocate(void* ptr, std::size_t bytes, rmm::cuda_stream_view stream = rmm::cuda_stream_view{}) {
try {
memory_resource.deallocate(ptr, bytes, stream);
} catch(...) {
cudaStreamSynchronize(stream);
try {
memory_resource.deallocate(ptr, bytes, stream);
} catch(...) {
throw std::runtime_error("RMM Failed to deallocate");
}
}
}
#endif
bool RunAllocations(std::queue<allocation_free> allocations, std::string_view run_type) {
std::map<void *, void *> allocation_ptr_map;
std::size_t count = 0;
bool success = true;
size_t bytes_allocated = 0;
size_t bytes_deallocated = 0;
while(!allocations.empty()) {
auto allocation = std::move(allocations.front());
allocations.pop();
if(allocation.allocate && success) {
allocation_ptr_map[allocation.allocation_ptr] = nullptr;
cudaError_t result;
if (run_type == "async") {
// result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
result = trycudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
} else if (run_type == "async_from_pool") {
// result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
result = trycudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
} else if (run_type == "rmm_async") {
#ifdef RMM_SUPPORT
result = cudaSuccess;
try {
allocation_ptr_map[allocation.allocation_ptr] = tryRMMallocate(allocation.size);
} catch(...) {
result = cudaErrorMemoryAllocation;
}
#endif
} else if (run_type == "cuda_malloc") {
result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
}
if (cudaSuccess != result) {
std::cout << "FAILURE" << std::endl;
std::cout << "failed to allocate "<< allocation.size << " on row " << count << std::endl;
success = false;
if (run_type != "cuda_malloc") {
cudaStreamSynchronize(0);
size_t size_reserved;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_reserved);
std::cout << "Pool Reservation Size: " << size_reserved << std::endl;
size_t size_used;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
std::cout << "Mem Used Size: " << size_used << std::endl;
}
}
bytes_allocated += allocation.size;
} else if (!allocation.allocate) {
cudaError_t result;
if (run_type == "async" || run_type == "async_from_pool") {
// result = cudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
result = trycudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
} else if (run_type == "rmm_async") {
#ifdef RMM_SUPPORT
result = cudaSuccess;
try {
tryRMMdeallocate(allocation_ptr_map[allocation.allocation_ptr], allocation.size);
} catch(...) {
result = cudaErrorMemoryAllocation;
}
#endif
} else if (run_type == "cuda_malloc") {
result = cudaFree(allocation_ptr_map[allocation.allocation_ptr]);
}
if(cudaSuccess != result) {
std::cout << "FAILURE" << std::endl;
std::cout<< "failed to deallocate " << allocation.size << " on row " << count <<std::endl;
success = false;
if (run_type != "cuda_malloc") {
size_t size_reserved;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_reserved);
std::cout << "Pool Reservation Size: " << size_reserved << std::endl;
size_t size_used;
cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
std::cout << "Mem Used Size: " << size_used << std::endl;
}
}
bytes_deallocated += allocation.size;
allocation_ptr_map.erase(allocation.allocation_ptr);
}
count++;
}
auto status = cudaStreamSynchronize(0);
if(status != cudaError::cudaSuccess) {
throw std::runtime_error("CUDA Error");
}
std::cout << std::endl;
std::cout << "RUN RESULT: ";
if (success) {
std::cout << "SUCCESS";
} else {
std::cout << "FAILURE";
}
std::cout << std::endl;
std::cout << "Bytes Allocated " << bytes_allocated << " Bytes Deallocated " << bytes_deallocated << std::endl;
return success;
}
int main(int argc, char* argv[]) {
std::vector<std::string> command_line_arguments;
if (argc > 1) {
command_line_arguments.assign(argv + 1, argv + argc);
}
if(argc != 3) {
std::cout << "./binary file_path run_type" << std::endl;
return 1;
}
std::string file_path = command_line_arguments[0];
std::string run_type = command_line_arguments[1];
allocations = ParseAllocations(file_path);
if (run_type != "cuda_malloc") {
//setup pool
cudaDeviceGetDefaultMemPool(&memPool, 0);
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold);
if (run_type == "rmm_async") {
#ifdef RMM_SUPPORT
memory_resource = rmm::mr::cuda_async_view_memory_resource(memPool);
#endif
}
}
RunAllocations(allocations, run_type);
return 0;
}
To build with RMM support you need to add the compile time definition of RMM_SUPPORT
.
On an 80GB A100 with the previously provided rmm.log
file this succeeds with cudaMalloc
but fails with any of the other options.
Thanks @kkraus14. I passed your updated repro along to the driver team and asked that they test on the same driver version you're using to make sure it isn't an issue of using different versions.
Thanks @kkraus14. I passed your updated repro along to the driver team and asked that they test on the same driver version you're using to make sure it isn't an issue of using different versions.
Thanks @jrhemstad. If there's a different driver version we should try on our side just let us know and we should be able to change it.
Driver team reported they were able to reproduce with the new example and haven't identified the root cause yet.
The updates to the driver last Summer definitely fixed the issue that we were seeing. We have run into it again and are able to consistently trigger an allocation failure with the run_type = async_from_pool and async . We have about 20GB of memory free and are failing to allocate a little over 1GB.
./rmm_test rmm_log.txt async_from_pool
FAILURE
failed to allocate 1139337600 on row 2044445
Pool Reservation Size: 84523614208
Mem Used Size: 64541259920
RUN RESULT: FAILURE
Bytes Allocated 4593202635792 Bytes Deallocated 8380758376072
I have attached the file we are using to test. Any thoughts @jrhemstad ? This works just fine if I set the allocator to cuda_malloc.
nvidia-smi output
nvidia-smi
Wed Mar 27 22:06:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:27:00.0 Off | 0 |
| N/A 34C P0 66W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:2A:00.0 Off | 0 |
| N/A 30C P0 64W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:51:00.0 Off | 0 |
| N/A 30C P0 62W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:57:00.0 Off | 0 |
| N/A 32C P0 61W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:9E:00.0 Off | 0 |
| N/A 33C P0 65W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:A4:00.0 Off | 0 |
| N/A 30C P0 65W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C7:00.0 Off | 0 |
| N/A 30C P0 64W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CA:00.0 Off | 0 |
| N/A 32C P0 63W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
Hi @felipeblazing did you modify the replay benchmark to accept async_from_pool
option? We don't have that in the current version of the benchmark. Also, it would be nice to have the version that outputs this:
./rmm_test rmm_log.txt async_from_pool
FAILURE
failed to allocate 1139337600 on row 2044445
Pool Reservation Size: 84523614208
Mem Used Size: 64541259920
RUN RESULT: FAILURE
Bytes Allocated 4593202635792 Bytes Deallocated 8380758376072
Contribution welcome!
@harrism its in keiths message above. https://github.com/rapidsai/rmm/issues/1245#issuecomment-1514028072
A few things we have noticed. When we call cudaMemPoolGetAttribute and check granularity we see it is set to 2MB. Some tests we ran seemed to indicate to us that allocations are getting padded to 2MB. If this is the case then it isn't just a problem of fragmentation but properly tracking memory consumption. Is that possible?
This does not seem to be the case for small allocations. There when we track memory consumption on the driver it does seem to fit multiple allocations inside a cudaMempool_t into one 2MB buffer.