spark-rapids [FEA] Track held device memory per thread

When invoking cuDF we may or may not hold GPU memory. The purpose of this task is to add a mechanism that may need cuDF changes, to track what each thread has allocated, especially at OOM time.

@revans2 brought up the fact that for each allocation we add an ID in cuDF which lets us trace these allocations to chase down leaks. He mentioned we could also add the Thread ID to these allocations, allowing us to attribute an allocation to a particular thread later on. For example, for the OOM case, we could be tracking Thread IDs in GpuSemaphore that are known to be active on the GPU. When the OOM handling occurs, and we can't spill anymore, we could then ask cuDF to provide outstanding allocations for a particular Thread ID. We could dump this information or a summary of it to the logs.

Note I don't see this as something that needs to be particularly fast, it's supposed to be a "last resort" action that we take to help us debug things further. That said, care must be taken to not affect performance in the good/happy case.

Oct 10 '22 15:10 abellina

Thought about this issue a bit more, what I think we want is a version of the tracking_resource_adaptor but, rather than have a single map for all threads, I think that we want to keep track of the maximum outstanding GPU footprint per thread. Also to note, the main motivation here would be to figure out if our estimation on memory usage for some GPU code is higher than anticipated, to help us debug waste or inform heuristics to control what tasks we allow on the GPU.

This should allow us to do the following:

val maxOutstandingUsage = withMemoryTracking { 
  val gpuData = materialize data on gpu
  val result = withResource(gpuData) { _.callCudfFunction }
  result.close() 
    // at this point our maximum outstanding should be:
    // gpuData + max(allocated) inside of `callCudfFunction`
}

In this scenario when we enter the withMemoryTracking block, we would ask a per-thread tracking resource to start tracking this thread before we materialize data. The materialization of gpuData incurs calls to rmm to get memory, so that adds to the outstanding amount, and then the call to the cuDF code could be allocations that are kept around (outstanding) for a while, allocations and frees that happen within the C++ code before the kernel, or results from this code. So we can keep track of how much is outstanding at any given time by adding to a thread-local variable how many bytes have been requested, and subtracting when we call free.

If one of our allocations failed and we handled them via a spill it shouldn't matter. That is because the spill code should be careful to disable the tracking for those spills (e.g. a withoutMemoryTracking call). This means we wouldn't discount frees in this thread for some other thread's allocations that are irrelevant to the code being tracked.

I hope/believe this could be a pretty low overhead system. Note this doesn't, I don't think, help tracking when an expensive kernel is loaded, as far as I understand that can be a one-time-penalty when we open the shared library. I know we have seen this with some of the regular expression kernels in the past. Pinging @jlowe on this overall for comments.

Oct 13 '22 16:10 abellina

I think one approach here is to have a stack of simple memory tracking info in RmmJni. When a withMemoryTracking block is issued we push to the stack one of these objects. The tracking_resource_adaptor could then check this stack for the current thread, and if it has something in it, it uses the top tracker to track allocations for now.

When withMemoryTracking is finishing, it calls a function in the RMM jni bits to pop this element from the stack. If it is the last element, we have turned the feature off. If it is not the last element we get the amount tracked in this scope and add the maximum outstanding we just popped to the next element in the stack (the calling scope also saw that maximum outstanding), and we continue to track with the remaining tracker in the stack.

We also need to keep a set of addresses we allocated in this thread, unfortunately. Given spill, the current thread may need to spill to satisfy an allocation. It seems we could ignore frees that we didn't allocate while tracking. The hope is that these withMemoryTracking blocks are as close as possible to a cuDF call.

Oct 17 '22 13:10 abellina

Nsys has added memory tracking capabilities as of late, and we believe we can use the correlationId + NVTX ranges to accomplish this as a post processing step given an NVTX range. We should investigate if this solution does what we need.

Feb 10 '23 17:02 abellina

Hi @abellina I am trying to profile the GPU memory usage during a query run. I used nsys to profile, but didn't find metrics like peak memory usage

I was using NVIDIA Nsight Systems version 2022.2.1.31-5fe97ab installed in our internal cluster. I saw a post about it: https://forums.developer.nvidia.com/t/nsys-measure-memory/118394 which is posted on 2021, but it contains the memory usage part in the graph...

Update: The memory usage metrics are disabled by default, it can be turned on by an extra nsys argument --cuda-memory-usage=true Then we can see the memory utilization part in the graph:

Oct 25 '23 09:10 wjxiz1992

I haven't used this feature, the main question I'd have is whether it works with a pool, especially the async pools. It most definitely does not work with ARENA because that's all CPU managed, but cudaAsync I'd hope shows it.

Oct 27 '23 15:10 abellina

The profile result above is from a run with ASYNC pool.

Oct 30 '23 05:10 wjxiz1992