rmm icon indicating copy to clipboard operation
rmm copied to clipboard

[BUG] Fragmentation with cuda_async_memory_resource

Open felipeblazing opened this issue 1 year ago • 32 comments

Describe the bug rmm::mr::cuda_async_memory_resource showing signs of pretty severe fragmentation after allocating / deallocating ~300GB on an 80GB GPU. I was able to perform a series of allocations, all on a single thread, using the the synchronous cuda_memory_resource without running out of memory. It runs out of memory when there are ~30GB of memory that are available but not able to allocate ~110MB.

cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used ); Gives a usage of 53514136688

cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used ); Shows a reservation of 84490059776

A tracking_resource_adaptor.get_allocated_bytes() shows 53514136688

I am failing to allocate 132870728 bytes

Steps/Code to reproduce bug rmm_test.zip There is now a simpler test found in rmm_simple_test.cpp. It just performs allocate and deallocate.

I have attached a Test which is similar to rmm_replay except that when it runs out of memory it first synchronizes and tries again, if it still fails it queues the allocation and the frees for those corresponding allocations until we free up 8x the last allocation failure. At which point it resumes by pushing the queued allocations to the front of the allocation list and picking back up. If the amount of allocations that gets queued grows above a certain threshold, currently set at 20GB using kMaxAllocationBytesQueue, then it just fails and stops trying. The test consists of a single cpp, a cmake file, and the rmm_log file which I am replaying to generate the error.

The binary is launched using

./rmm_test {path to log file} {cuda_async | bin_small_async_large}

When it fails to allocate it will output the measurements I showed above in the bug description section.

Expected behavior I would expect the allocator to be able to handle fragmentation particularly when so much of the pool is still free. I also expect that it should still be able to handle fragmentation under the hood. From this blog We can read the text:

If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.

Due to this I would expect there not to be fragmentation issues. With my current use case

**Environment details **

  • Environment location: Bare Metal
  • Method of RMM install: conda

Additional context There are two variants I tried using cuda_async_memory_resource. One that bins all allocations smaller than 4194304 with a pool_memory_resource<cuda_memory_resource> and a cuda_async_memory_resource for larger allocations and one that just uses cuda_async_memory_resource for all allocations.

I figured removing the smaller allocations would reduce the incidence of fragmentation. This doesn't seem to have been enough. I am trying to figure out if there is anything I can do at run time to make sure the pool gets defragmented.

felipeblazing avatar Apr 13 '23 14:04 felipeblazing

when it runs out of memory it first synchronizes and tries again, if it still fails it queues the allocation and the frees for those corresponding allocations until we free up 8x the last allocation failure. At which point it resumes by pushing the queued allocations to the front of the allocation list and picking back up. If the amount of allocations that gets queued grows above a certain threshold, currently set at 20GB using kMaxAllocationBytesQueue, then it just fails and stops trying.

Can you reproduce this behavior without all the extra logic? The extra logic makes it difficult to isolate where the problem is.

It would be ideal if you could provide a simple sequence of allocations/frees that succeed with cudaMalloc but fail with cudaMallocAsync.

jrhemstad avatar Apr 13 '23 17:04 jrhemstad

Ok I felt like we needed to give it a chance to free up memory before continuing. This zip has an extra file and target called rmm_simple_test rmm_test.zip

felipeblazing avatar Apr 13 '23 18:04 felipeblazing

@jrhemstad to be clear. Is this test simple enough or do you want to see this behaviour WITHOUT rmm?

felipeblazing avatar Apr 13 '23 19:04 felipeblazing

@jrhemstad to be clear. Is this test simple enough or do you want to see this behaviour WITHOUT rmm?

If you wouldn't mind isolating it to remove RMM that would make it easier to forward to the internal teams to take a look. That could also expose that the error may be in RMM as opposed to cudaMallocAsync itself.

jrhemstad avatar Apr 13 '23 21:04 jrhemstad

Here is a new group of tests. rmm_test.zip

The one without rmm is called async_malloc_test.cpp This one also fails using cudaMallocAsync directly

felipeblazing avatar Apr 13 '23 22:04 felipeblazing

For reference, can you let me know what CUDA/driver version you're using?

jrhemstad avatar Apr 14 '23 17:04 jrhemstad

Filed internal bug# 4072461

I'll report back when we hear something from the driver team.

jrhemstad avatar Apr 14 '23 17:04 jrhemstad

nvidia-smi
Fri Apr 14 17:58:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0    66W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |

felipeblazing avatar Apr 14 '23 17:04 felipeblazing

mamba list | grep cuda
cuda-python               11.8.1          py310h01a121a_2    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge

felipeblazing avatar Apr 14 '23 18:04 felipeblazing

the toolkit in conda does not match the driver I can try compiling outside of the conda environment if that might change something.

felipeblazing avatar Apr 14 '23 18:04 felipeblazing

It looks like from the log that allocation is happening from multiple threads. To be clear, this reproduces on a single thread, right?

jrhemstad avatar Apr 14 '23 18:04 jrhemstad

The log was made from multiple threads. This reproducer utilizes a single thread to perform all of the same allocations.

felipeblazing avatar Apr 14 '23 19:04 felipeblazing

Thanks @felipeblazing. And to double check, synchronizing the stream and retrying the allocation still fails, right?

jrhemstad avatar Apr 17 '23 21:04 jrhemstad

Yes.

felipeblazing avatar Apr 17 '23 23:04 felipeblazing

@felipeblazing do you get different results with pool_memory_resource or arena_memory_resource?

harrism avatar Apr 17 '23 23:04 harrism

The driver team took a look into this and they found that by synchronizing the stream and retrying the allocation that it succeeded. Here is the modified version of the repro they tested:

 82  while(!allocations.empty()){
 83
 84      auto allocation = std::move(allocations.front());
 85      allocations.pop();
 86
 87        if(allocation.allocate && success){
 88
 89            allocation_ptr_map[allocation.allocation_ptr] = nullptr;
 90            bool tried = false;
 91        retry:
 92            cudaError_t result;
 93            if(run_async){
 94                if(alloc_from_pool){
 95                    result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
 96                }else{
 97               result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
 98                }
 99
100            }else{
101                result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
102            }
103            if(cudaSuccess != result){
104                std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl;
105
106                cudaStreamSynchronize(0);
107
108                size_t size_used;
109                cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
110                std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl;
111
112                cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used );
113                std::cout<<"Mem used is= "<<size_used<<std::endl;
114
115                // auto error = cudaMemPoolTrimTo (  memPool,size_used );
116                // cudaStreamSynchronize(0);
117                // if(error != cudaSuccess){
118                // std::cout<<"failed to trim"<<std::endl;
119
120                // }else{
121
122                // cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
123                // std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl;
124                // }
125
126                if (tried) {
127                    return false;
128                } else {
129                    tried = true;
130                    goto retry;
131                }
132
133            }

jrhemstad avatar Apr 18 '23 00:04 jrhemstad

@felipeblazing do you get different results with pool_memory_resource or arena_memory_resource?

Arena memory resource does worst than pool (it makes the least progress in the allocation list). Pool and cuda_async_memory_resource are not too far off from one another.

felipeblazing avatar Apr 18 '23 00:04 felipeblazing

The driver team took a look into this and they found that by synchronizing the stream and retrying the allocation that it succeeded. Here is the modified version of the repro they tested:

 82  while(!allocations.empty()){
 83
 84      auto allocation = std::move(allocations.front());
 85      allocations.pop();
 86
 87        if(allocation.allocate && success){
 88
 89            allocation_ptr_map[allocation.allocation_ptr] = nullptr;
 90            bool tried = false;
 91        retry:
 92            cudaError_t result;
 93            if(run_async){
 94                if(alloc_from_pool){
 95                    result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
 96                }else{
 97               result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
 98                }
 99
100            }else{
101                result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
102            }
103            if(cudaSuccess != result){
104                std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl;
105
106                cudaStreamSynchronize(0);
107
108                size_t size_used;
109                cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
110                std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl;
111
112                cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used );
113                std::cout<<"Mem used is= "<<size_used<<std::endl;
114
115                // auto error = cudaMemPoolTrimTo (  memPool,size_used );
116                // cudaStreamSynchronize(0);
117                // if(error != cudaSuccess){
118                // std::cout<<"failed to trim"<<std::endl;
119
120                // }else{
121
122                // cudaMemPoolGetAttribute ( memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used );
123                // std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl;
124                // }
125
126                if (tried) {
127                    return false;
128                } else {
129                    tried = true;
130                    goto retry;
131                }
132
133            }

I will try that. One thing I should have mentioned is that I only tried retrying and synching with async_memory_resource not with cudamallocasync directly so that might explain the difference.

felipeblazing avatar Apr 18 '23 00:04 felipeblazing

~Confirmed that it works if using cudaMallocAsync / cudaMallocFromPoolAsync directly and synchronizing the stream when an allocation fails. Here's the reproducer code I used (tweaked from the snippet you provided):~

#include <memory>
#include <thread>
#include <queue>



#include <fstream>
#include <map>
#include <cuda.h>
#include<cuda_runtime.h>
#include <iostream>
#include <sstream>




struct allocation_free{
  std::size_t size;
  std::size_t thread;
  void * allocation_ptr;
  bool allocate;
};

std::queue<allocation_free> ParseAllocations(const std::string & file_path){
  std::fstream file;
  std::queue<allocation_free> allocations;
   file.open(file_path.c_str(),std::ios::in);
   if (file.is_open()){   
      std::string line_str;
      std::getline(file, line_str);
      while(std::getline(file, line_str)){ 
        allocation_free alloc;
        std::string element;
        std::istringstream line(line_str);
        std::getline(line, element, ',');

        alloc.thread = std::stoull(element.c_str());
        std::getline(line, element, ',');
        std::getline(line, element, ',');
        //std::cout<<element<<std::endl;
        if(element == "allocate"){
          alloc.allocate = true;
        }else if(element == "free"){
          alloc.allocate = false;
        }else{
            //ignoring failures
          continue;
        }
        //std::cout<<element<<std::endl;
        std::getline(line, element, ',');
        alloc.allocation_ptr = (void *) std::stoull(element,nullptr,16);
        std::getline(line, element, ',');
        alloc.size=std::stoull(element);
        
        allocations.push(alloc);
      }

      file.close(); 
   }
   return allocations;
}

  std::queue<allocation_free> allocations;



cudaMemPool_t memPool = nullptr;

cudaError_t trycudaMallocFromPoolAsync(void** ptr, size_t size, cudaMemPool_t memPool, cudaStream_t stream) {
    cudaError_t result;
    result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
    }
    return result;
}

cudaError_t trycudaMallocAsync(void** ptr, size_t size, cudaStream_t stream) {
    cudaError_t result;
    result = cudaMallocAsync(ptr, size, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaMallocAsync(ptr, size, stream);
    }
    return result;
}

cudaError_t trycudaFreeAsync(void* devPtr, cudaStream_t stream) {
    cudaError_t result;
    result = cudaFreeAsync(devPtr, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaFreeAsync(devPtr, stream);
    }

    return result;
}


bool RunAllocations(std::queue<allocation_free> allocations, bool run_async, bool alloc_from_pool) {

  std::map<void *, void *> allocation_ptr_map;
  std::size_t count = 0;
  bool success = true;
  size_t bytes_allocated = 0;
  size_t bytes_deallocated = 0;

  while(!allocations.empty()) {
      
    auto allocation = std::move(allocations.front());
    allocations.pop();
        
    if(allocation.allocate && success) {

        allocation_ptr_map[allocation.allocation_ptr] = nullptr;
        cudaError_t result;
        if(run_async) {
            if(alloc_from_pool) {
                // result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
                result = trycudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
            } else {
                // result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
                result = trycudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);

            }
 
        } else {
            result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
        }

        if(cudaSuccess != result) {
            std::cout<<"failed to allocate "<<allocation.size<<" on row "<<count<<std::endl;
            success = false;

            size_t size_used;
            cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used);
            std::cout<<"Pool Reservation Size before trim= "<<size_used<<std::endl;

            cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
            std::cout<<"Mem used is= "<<size_used<<std::endl;
            cudaStreamSynchronize(0);
            auto error = cudaMemPoolTrimTo(memPool,size_used);
            cudaStreamSynchronize(0);

            if(error != cudaSuccess) {
                std::cout<<"failed to trim"<<std::endl;
            } else {
                cudaMemPoolGetAttribute (memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_used);
                std::cout<<"Pool Reservation Size after trim= "<<size_used<<std::endl;
            }        
        }

        bytes_allocated += allocation.size;
        
    } else if(!allocation.allocate) {

        cudaError_t result;
        if(run_async) {
            // result = cudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
            result = trycudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
        } else {
            result = cudaFree(allocation_ptr_map[allocation.allocation_ptr]);
        }
        if(cudaSuccess != result) {
            success = false;
            std::cout<<"failed to free memory"<<std::endl;
        }
        bytes_deallocated += allocation.size;
        allocation_ptr_map.erase(allocation.allocation_ptr);

    }

    count++;   
  }

    auto status = cudaStreamSynchronize(0); 
    if(status != cudaError::cudaSuccess) {
        throw std::runtime_error("Error in cuda");
    }
    std::cout<<"Bytes Allocated "<<bytes_allocated<<" Bytes Deallocated "<<bytes_deallocated<<std::endl;
    return success;

}


int main(int argc, char* argv[]) {

    std::vector<std::string> command_line_arguments;
    if (argc > 1) {
        command_line_arguments.assign(argv + 1, argv + argc);
    }
    if(argc != 2) {
        std::cout<<"./binary file_path"<<std::endl;
        return 1;
    }
    std::string file_path = command_line_arguments[0];

    allocations = ParseAllocations(file_path);

    // //run synchronous
    // RunAllocations(allocations,false,false);

    //setup pool
    cudaDeviceGetDefaultMemPool(&memPool, 0);
    uint64_t threshold = UINT64_MAX;
    cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold);

    RunAllocations(allocations,true,false);
    RunAllocations(allocations,true,true);

    return 0;   
}

~Will work on a reproducer for RMM to see if that can be handled as well 😄~

EDIT: This actually fails and I misread the output 🤦

kkraus14 avatar Apr 18 '23 21:04 kkraus14

Also, the internal folks wanted to pass along that the TrimTo call is likely unnecessary:

A process 'A' only needs to call TrimTo in conditions like these: a) another driver, like Vulkan or DirectX, in process 'A' is also trying to allocate memory on the same GPU and failing. Then calling TrimTo might help because CUDA would release unused memory back to the OS, which the other driver could pick up. b) another process 'B', potentially using CUDA (or just about any driver), is trying to allocate memory on the same GPU and failing. Here again calling TrimTo in process 'A' helps for the same reasons as mentioned above. Note here though that this scenario assumes the two processes know about each other and hence can coordinate. It doesn't make much sense for an isolated process - that has no idea what other processes are using the same GPU - to call TrimTo. It might just be hurting its own performance by unnecessarily calling TrimTo.

jrhemstad avatar Apr 18 '23 21:04 jrhemstad

Also note that in a real use case where you're allocating on multiple threads all using PTDS, you may need to synchronize every thread's PTDS when an allocation fails in order to ensure the maximum amount of memory is available.

jrhemstad avatar Apr 18 '23 21:04 jrhemstad

Here's an updated reproducer that allows controlling whether to run using cudaMalloc, cudaMallocAsync, cudaMallocFromPoolAsync, or rmm::mr::cuda_async_view_memory_resource::allocate:

#include <memory>
#include <thread>
#include <queue>



#include <fstream>
#include <map>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <sstream>

#ifdef RMM_SUPPORT
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/cuda_async_view_memory_resource.hpp>
#include <rmm/cuda_stream_view.hpp>
#endif




struct allocation_free{
  std::size_t size;
  std::size_t thread;
  void * allocation_ptr;
  bool allocate;
};

std::queue<allocation_free> ParseAllocations(const std::string & file_path){
  std::fstream file;
  std::queue<allocation_free> allocations;
   file.open(file_path.c_str(),std::ios::in);
   if (file.is_open()){   
      std::string line_str;
      std::getline(file, line_str);
      while(std::getline(file, line_str)){ 
        allocation_free alloc;
        std::string element;
        std::istringstream line(line_str);
        std::getline(line, element, ',');

        alloc.thread = std::stoull(element.c_str());
        std::getline(line, element, ',');
        std::getline(line, element, ',');
        //std::cout<<element<<std::endl;
        if(element == "allocate"){
          alloc.allocate = true;
        }else if(element == "free"){
          alloc.allocate = false;
        }else{
            //ignoring failures
          continue;
        }
        //std::cout<<element<<std::endl;
        std::getline(line, element, ',');
        alloc.allocation_ptr = (void *) std::stoull(element,nullptr,16);
        std::getline(line, element, ',');
        alloc.size=std::stoull(element);
        
        allocations.push(alloc);
      }

      file.close(); 
   }
   return allocations;
}

  std::queue<allocation_free> allocations;



cudaMemPool_t memPool = nullptr;

#ifdef RMM_SUPPORT
rmm::mr::cuda_async_view_memory_resource memory_resource;
#endif

cudaError_t trycudaMallocFromPoolAsync(void** ptr, size_t size, cudaMemPool_t memPool, cudaStream_t stream) {
    cudaError_t result;
    result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaMallocFromPoolAsync(ptr, size, memPool, stream);
    }
    return result;
}

cudaError_t trycudaMallocAsync(void** ptr, size_t size, cudaStream_t stream) {
    cudaError_t result;
    result = cudaMallocAsync(ptr, size, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaMallocAsync(ptr, size, stream);
    }
    return result;
}

cudaError_t trycudaFreeAsync(void* devPtr, cudaStream_t stream) {
    cudaError_t result;
    result = cudaFreeAsync(devPtr, stream);
    if (cudaSuccess != result) {
        cudaStreamSynchronize(stream);
        result = cudaFreeAsync(devPtr, stream);
    }

    return result;
}

#ifdef RMM_SUPPORT
void* tryRMMallocate(std::size_t bytes, rmm::cuda_stream_view stream = rmm::cuda_stream_view{}) {
    try {
        return memory_resource.allocate(bytes, stream); 
    } catch(...) {
        cudaStreamSynchronize(stream);
        try {
            return memory_resource.allocate(bytes, stream);
        } catch(...) {
            throw std::runtime_error("RMM Failed to allocate");
        }
    }
}

void tryRMMdeallocate(void* ptr, std::size_t bytes, rmm::cuda_stream_view stream = rmm::cuda_stream_view{}) {
    try {
        memory_resource.deallocate(ptr, bytes, stream); 
    } catch(...) {
        cudaStreamSynchronize(stream);
        try {
            memory_resource.deallocate(ptr, bytes, stream);
        } catch(...) {
            throw std::runtime_error("RMM Failed to deallocate");
        }
    }
}
#endif


bool RunAllocations(std::queue<allocation_free> allocations, std::string_view run_type) {

  std::map<void *, void *> allocation_ptr_map;
  std::size_t count = 0;
  bool success = true;
  size_t bytes_allocated = 0;
  size_t bytes_deallocated = 0;

  while(!allocations.empty()) {
      
    auto allocation = std::move(allocations.front());
    allocations.pop();
        
    if(allocation.allocate && success) {

        allocation_ptr_map[allocation.allocation_ptr] = nullptr;
        cudaError_t result;
        if (run_type == "async") {
            // result = cudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
            result = trycudaMallocAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size, 0);
        } else if (run_type == "async_from_pool") {
            // result = cudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
            result = trycudaMallocFromPoolAsync(&allocation_ptr_map[allocation.allocation_ptr], allocation.size,memPool, 0);
        } else if (run_type == "rmm_async") {
            #ifdef RMM_SUPPORT
            result = cudaSuccess;
            try {
                allocation_ptr_map[allocation.allocation_ptr] = tryRMMallocate(allocation.size);
            } catch(...) {
                result = cudaErrorMemoryAllocation;
            }
            #endif
        } else if (run_type == "cuda_malloc") {
            result = cudaMalloc(&allocation_ptr_map[allocation.allocation_ptr], allocation.size);
        }

        if (cudaSuccess != result) {
            std::cout << "FAILURE" << std::endl;
            std::cout << "failed to allocate "<< allocation.size << " on row " << count << std::endl;
            success = false;

            if (run_type != "cuda_malloc") {
                cudaStreamSynchronize(0);

                size_t size_reserved;
                cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_reserved);
                std::cout << "Pool Reservation Size: " << size_reserved << std::endl;

                size_t size_used;
                cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
                std::cout << "Mem Used Size: " << size_used << std::endl;
            }
        }
        bytes_allocated += allocation.size;
        
    } else if (!allocation.allocate) {

        cudaError_t result;
        if (run_type == "async" || run_type == "async_from_pool") {
            // result = cudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
            result = trycudaFreeAsync(allocation_ptr_map[allocation.allocation_ptr], 0);
        } else if (run_type == "rmm_async") {
            #ifdef RMM_SUPPORT
            result = cudaSuccess;
            try {
                tryRMMdeallocate(allocation_ptr_map[allocation.allocation_ptr], allocation.size);
            } catch(...) {
                result = cudaErrorMemoryAllocation;
            }
            #endif
        } else if (run_type == "cuda_malloc") {
            result = cudaFree(allocation_ptr_map[allocation.allocation_ptr]);
        }

        if(cudaSuccess != result) {
            std::cout << "FAILURE" << std::endl;
            std::cout<< "failed to deallocate " << allocation.size << " on row " << count <<std::endl;
            success = false;

            if (run_type != "cuda_malloc") {
                size_t size_reserved;
                cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrReservedMemCurrent, &size_reserved);
                std::cout << "Pool Reservation Size: " << size_reserved << std::endl;

                size_t size_used;
                cudaMemPoolGetAttribute(memPool, cudaMemPoolAttr::cudaMemPoolAttrUsedMemCurrent, &size_used);
                std::cout << "Mem Used Size: " << size_used << std::endl;
            }
        }
        bytes_deallocated += allocation.size;
        allocation_ptr_map.erase(allocation.allocation_ptr);

    }

    count++;   
  }

    auto status = cudaStreamSynchronize(0); 
    if(status != cudaError::cudaSuccess) {
        throw std::runtime_error("CUDA Error");
    }
    std::cout << std::endl;

    std::cout << "RUN RESULT: ";
    if (success) {
        std::cout << "SUCCESS";
    } else {
        std::cout << "FAILURE";
    }
    std::cout << std::endl;

    std::cout << "Bytes Allocated " << bytes_allocated << " Bytes Deallocated " << bytes_deallocated << std::endl;
    return success;

}


int main(int argc, char* argv[]) {

    std::vector<std::string> command_line_arguments;
    if (argc > 1) {
        command_line_arguments.assign(argv + 1, argv + argc);
    }
    if(argc != 3) {
        std::cout << "./binary file_path run_type" << std::endl;
        return 1;
    }
    std::string file_path = command_line_arguments[0];
    std::string run_type = command_line_arguments[1];

    allocations = ParseAllocations(file_path);

    if (run_type != "cuda_malloc") {
        //setup pool
        cudaDeviceGetDefaultMemPool(&memPool, 0);
        uint64_t threshold = UINT64_MAX;
        cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &threshold);
        if (run_type == "rmm_async") {
            #ifdef RMM_SUPPORT
            memory_resource = rmm::mr::cuda_async_view_memory_resource(memPool);
            #endif
        }
    }

    RunAllocations(allocations, run_type);

    return 0;   
}

To build with RMM support you need to add the compile time definition of RMM_SUPPORT.

On an 80GB A100 with the previously provided rmm.log file this succeeds with cudaMalloc but fails with any of the other options.

kkraus14 avatar Apr 19 '23 01:04 kkraus14

Thanks @kkraus14. I passed your updated repro along to the driver team and asked that they test on the same driver version you're using to make sure it isn't an issue of using different versions.

jrhemstad avatar Apr 19 '23 13:04 jrhemstad

Thanks @kkraus14. I passed your updated repro along to the driver team and asked that they test on the same driver version you're using to make sure it isn't an issue of using different versions.

Thanks @jrhemstad. If there's a different driver version we should try on our side just let us know and we should be able to change it.

kkraus14 avatar Apr 19 '23 13:04 kkraus14

Driver team reported they were able to reproduce with the new example and haven't identified the root cause yet.

jrhemstad avatar Apr 19 '23 23:04 jrhemstad

The updates to the driver last Summer definitely fixed the issue that we were seeing. We have run into it again and are able to consistently trigger an allocation failure with the run_type = async_from_pool and async . We have about 20GB of memory free and are failing to allocate a little over 1GB.

./rmm_test rmm_log.txt async_from_pool
FAILURE
failed to allocate 1139337600 on row 2044445
Pool Reservation Size: 84523614208
Mem Used Size: 64541259920

RUN RESULT: FAILURE
Bytes Allocated 4593202635792 Bytes Deallocated 8380758376072

rmm_log.zip

I have attached the file we are using to test. Any thoughts @jrhemstad ? This works just fine if I set the allocator to cuda_malloc.

nvidia-smi output

 nvidia-smi
Wed Mar 27 22:06:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:27:00.0 Off |                    0 |
| N/A   34C    P0              66W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:2A:00.0 Off |                    0 |
| N/A   30C    P0              64W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:51:00.0 Off |                    0 |
| N/A   30C    P0              62W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:57:00.0 Off |                    0 |
| N/A   32C    P0              61W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:9E:00.0 Off |                    0 |
| N/A   33C    P0              65W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:A4:00.0 Off |                    0 |
| N/A   30C    P0              65W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   30C    P0              64W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   32C    P0              63W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

felipeblazing avatar Mar 27 '24 21:03 felipeblazing

Hi @felipeblazing did you modify the replay benchmark to accept async_from_pool option? We don't have that in the current version of the benchmark. Also, it would be nice to have the version that outputs this:

./rmm_test rmm_log.txt async_from_pool
FAILURE
failed to allocate 1139337600 on row 2044445
Pool Reservation Size: 84523614208
Mem Used Size: 64541259920

RUN RESULT: FAILURE
Bytes Allocated 4593202635792 Bytes Deallocated 8380758376072

Contribution welcome!

harrism avatar Apr 24 '24 06:04 harrism

@harrism its in keiths message above. https://github.com/rapidsai/rmm/issues/1245#issuecomment-1514028072

felipeblazing avatar Apr 24 '24 12:04 felipeblazing

A few things we have noticed. When we call cudaMemPoolGetAttribute and check granularity we see it is set to 2MB. Some tests we ran seemed to indicate to us that allocations are getting padded to 2MB. If this is the case then it isn't just a problem of fragmentation but properly tracking memory consumption. Is that possible?

felipeblazing avatar Apr 24 '24 14:04 felipeblazing

This does not seem to be the case for small allocations. There when we track memory consumption on the driver it does seem to fit multiple allocations inside a cudaMempool_t into one 2MB buffer.

felipeblazing avatar Apr 24 '24 14:04 felipeblazing