rmm [BUG] Maximum pool size exceeded when using ManagedMemory

Describe the Bug

We use RMM with PyTorch and a managed_memory pool to analyze a simulation trajectory. When iterating the frames of the trajectory, the pool size keeps increasing until it hits an 'out-of-memory' error, specifically out-of-memory: Maximum pool size exceeded. I configured PyTorch to use RMM as the memory allocator. Our problem size is very large, involving several tensors, each of which is 13GB or more during processing.

Error message

time offset is 2.65 , segment length is 4000
Total frames: 8001, total frames in segment: 4000, frame range: 4000 - 8000
 13%|█▎        | 516/4000 [1:50:41<10:11:17, 10.53s/it]Traceback (most recent call last):
  File "torch_allocator.pyx", line 15, in rmm._lib.torch_allocator.allocate
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/blue/program/miniconda3/envs/rapids-23.10/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/molfind.py", line 106, in analyze_all_frames
    df_formula, df_molecule = analyze_a_frame(
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 237, in analyze_a_frame
    cG, df_per_frag = find_fragments(species, positions, cell, pbc, use_cell_list=use_cell_list)
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 188, in find_fragments
    atom_index12, distances, _ = neighborlist(species, coordinates, cell=cell, pbc=pbc)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 510, in forward
    atom_pairs, shift_indices = self._calculate_cell_list(coordinates_displaced.detach(), pbc)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 595, in _calculate_cell_list
    lower, between_pairs_translation_types = self._get_lower_between_image_pairs(neighbor_count,
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 912, in _get_lower_between_image_pairs
    -1).repeat(1, 1, 1, padded_atom_neighbors.shape[-1])
SystemError: <method 'repeat' of 'torch._C._TensorBase' objects> returned a result with an exception set
 13%|█▎        | 517/4000 [1:50:49<9:23:59,  9.72s/it] Traceback (most recent call last):
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/molfind.py", line 106, in analyze_all_frames
    df_formula, df_molecule = analyze_a_frame(
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 223, in analyze_a_frame
    torch.tensor(mdtraj_frame.xyz, device=device).float().view(1, -1, 3) * 10.0
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps/Code to reproduce bug

Relevant code snippet

import torch
import rmm
from rmm.allocators.torch import rmm_torch_allocator

# rmm resource logging
rmm.reinitialize(pool_allocator=True, managed_memory=True, maximum_pool_size=300 * 1024 * 1024 * 1024, logging=True, log_file_name="logging_resource.csv")

# Configure PyTorch to use RAPIDS Memory Manager (RMM) for GPU memory management.
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

A log is recorded using logging_resource_adaptor, attached at: logging_resource.dev0.csv.zip - Google Drive

The memory usage on the CPU is recorded every 10 seconds; the beginning of the run was missing: ram_log.txt

Environment details

Environment was created using

mamba create -n rapids-23.10 -c rapidsai -c conda-forge -c nvidia cudf=23.10 cugraph=23.10 python=3.10 cuda-version=11.8

The analysis was run on an A100 GPU with 81920MiB memory. The environment is also attached: env.txt

Jan 05 '24 22:01 yueyericardo

How much host RAM is available on this system? The CPU log peaks at around 360GiB host RAM usage. Could it be that you're running out of both host and device memory?

Jan 08 '24 14:01 wence-

Hi the host memory is 2TB, I requested 400GB ram in slurm when submitting this job.

Jan 08 '24 15:01 yueyericardo

I requested 400GB ram in slurm when submitting this job.

That might have been your problem (depending on how slurm manages these allocations). It could be that you got an allocation that didn't allow you to allocate more than 400GiB of host RAM: and your job needed more than this.

When performing the analysis, are you holding on to all previously read tensors? (Is this necessary for the algorithm?) Or are you deallocating them when done?

Jan 08 '24 15:01 wence-

I'm not holding the tensors, I'm iterating the frames using a for loop, for each frame I call a function which I assume it should deallocate automatically.

Here is the relevant code:

    for mdtraj_frame in tqdm(
        md.iterload(traj_file, top=top_file, chunk=1, stride=stride, skip=local_start_frame),
        total=total_frames_in_segment,
    ):
        try:
            df_formula, df_molecule = analyze_a_frame(
                mdtraj_frame,
                time_offset,
                dump_interval,
                timestep,
                stride,
                frame_num,
                mol_database,
                use_cell_list=True,
            )
# df_formula, df_molecule are pandas dataframe instead of cudf

Jan 08 '24 16:01 yueyericardo

Hmm. Nothing obviously looks bad there, but if the GPU (and host) memory usage is always increasing this is either because the RMM pool is so fragmented that you can never reuse a freed allocation, or else something is holding on to that memory such that it's not deallocated.

I did a tiny bit of analysis on the data from your logging resource adaptor log:

from collections import Counter
data = Counter()
with open("logging_resource.dev0.csv", "r") as f:
    for line in f.readlines():
        try:
            _, _, what, ptr, amount, _ = line.split(",")
            if what == "Action": continue
            key = (ptr, int(amount))
            if what == "allocate":
                data[key] += int(amount)
            elif what == "free":
                data[key] -= int(amount)
            else:
                raise RuntimeError
        except:
            pass

data = {k: v for k, v in data.items() if v != 0}

print(len(data)) # => 1808
print(sum(data.values()) / 1024**3) # => ~193 GiB

So you have 1808 allocations that are not freed in that log that add up to 193GiB of data. There are many that are ~90MiB, 129MiB, and 180MiB. Do those numbers look suspicious?

Jan 08 '24 16:01 wence-

Hi, I agree that it's possible the "RMM pool is so fragmented that you can never reuse a freed allocation".

'There are 1808 allocations that are not freed in that log, adding up to 193GiB of data.'

I'm not sure why this happens, as all local variables should be freed after the function call.

Here is my script if it helps. The function entry point is the analyze_all_frames function in molfind.py."

molfind.py.txt, fragment.py.txt

Jan 08 '24 20:01 yueyericardo

A quick look at the attached CSV log, sorting by size, all large allocations seem to have matching frees. Looking at them in order, the large allocations (~21GiB) occur in pairs, and are freed together. So that's good.

The distance between the large allocates and frees is not large -- maybe 10-12 other allocations (some of them 0B), but they are highly varied in size. I think fragmentation is definitely the suspect here.

Another interesting note, there are many allocations of ZERO bytes in this trace. Why is that? Is PyTorch doing that?

To learn more, we could try the replay benchmark which reads these CSV files and plays back the allocations and frees (C++ only). This could allow reproduction in a debugger but only on a machine with all this memory. So my question is: is it possible to reproduce on a smaller machine with a smaller problem size?

One thing we could try is the binning MR. This allows assigning different MRs for different size ranges. So you could have a pool (or a fixed-size MR) for very small allocations, a pool for very large allocations, and a pool for everything else.

Jan 08 '24 21:01 harrism

A quick look at the attached CSV log, sorting by size, all large allocations seem to have matching frees. Looking at them in order, the large allocations (~21GiB) occur in pairs, and are freed together. So that's good.

Although the large allocations all having matching frees (good), not all allocated data is freed AFAICT. And since this runs through a lot of memory this adds up. I did:

import pandas as pd
df = pd.read_csv("log.csv")

totals = df.groupby("Action")["Size"].sum()
# Action
# allocate            103316536769247
# allocate failure        18969600000
# free                103108556670956
# Name: Size, dtype: int64
(totals.loc["allocate"] - totals.loc["free"]) // 1024**3
# 193 GiB

# The same pattern exhibits if we just look at the first half of the allocations (or the second half)

I had a cursory read through the scripts and nothing obviously jumps out at me as likely to leak memory. If you can run for just a few steps, you might be able to see some likely candidates using some of the debugging techniques described here https://mg.pov.lt/objgraph/#memory-leak-example

Jan 09 '24 10:01 wence-

Oh I missed that 193 GiB in your previous post!

Jan 09 '24 20:01 harrism

Hi, this issue is not a priority for us now because we have a better CUDA implementation that we could use instead of the PyTorch implementation.

There seem to be two issues here:

The RMM pool is so fragmented that you can never reuse a freed allocation.
Not all allocated data is freed. I could try to create a minimal reproducible example to make it easier for you to debug, but it will take time. If everything works correctly, should I expect to see in the logging CSV that most of the allocated data is freed after the function returns in the for loop?

Jan 10 '24 17:01 yueyericardo

It would be helpful to figure out the source of memory leaks -- but that is not an RMM bug, it would be the caller (PyTorch, or your application). I would like to know if the fragmentation and OOM are still a problem once the memory leaks are solved. Could it be that you need to explicitly del objects in Python to force the allocations to be freed?

I suggested a strategy above to avoid the fragmentation if you would like to try it.

Jan 11 '24 03:01 harrism

I'm closing this due to no response from the author. @yueyericardo if you would like to continue exploring this with us feel free to reopen this issue. Thanks!

Apr 04 '24 06:04 harrism

rmm rmm copied to clipboard

[BUG] Maximum pool size exceeded when using ManagedMemory

Describe the Bug

Steps/Code to reproduce bug

Environment details

rmm
rmm copied to clipboard