rmm icon indicating copy to clipboard operation
rmm copied to clipboard

[BUG] Maximum pool size exceeded when using ManagedMemory

Open yueyericardo opened this issue 6 months ago • 11 comments

Describe the Bug

We use RMM with PyTorch and a managed_memory pool to analyze a simulation trajectory. When iterating the frames of the trajectory, the pool size keeps increasing until it hits an 'out-of-memory' error, specifically out-of-memory: Maximum pool size exceeded. I configured PyTorch to use RMM as the memory allocator. Our problem size is very large, involving several tensors, each of which is 13GB or more during processing.

Error message

time offset is 2.65 , segment length is 4000
Total frames: 8001, total frames in segment: 4000, frame range: 4000 - 8000
 13%|█▎        | 516/4000 [1:50:41<10:11:17, 10.53s/it]Traceback (most recent call last):
  File "torch_allocator.pyx", line 15, in rmm._lib.torch_allocator.allocate
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/blue/program/miniconda3/envs/rapids-23.10/include/rmm/mr/device/pool_memory_resource.hpp:196: Maximum pool size exceeded

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/molfind.py", line 106, in analyze_all_frames
    df_formula, df_molecule = analyze_a_frame(
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 237, in analyze_a_frame
    cG, df_per_frag = find_fragments(species, positions, cell, pbc, use_cell_list=use_cell_list)
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 188, in find_fragments
    atom_index12, distances, _ = neighborlist(species, coordinates, cell=cell, pbc=pbc)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 510, in forward
    atom_pairs, shift_indices = self._calculate_cell_list(coordinates_displaced.detach(), pbc)
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 595, in _calculate_cell_list
    lower, between_pairs_translation_types = self._get_lower_between_image_pairs(neighbor_count,
  File "/blue/program/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/torchani-2.3.dev211+gb682b46c-py3.10-linux-x86_64.egg/torchani/neighbors.py", line 912, in _get_lower_between_image_pairs
    -1).repeat(1, 1, 1, padded_atom_neighbors.shape[-1])
SystemError: <method 'repeat' of 'torch._C._TensorBase' objects> returned a result with an exception set
 13%|█▎        | 517/4000 [1:50:49<9:23:59,  9.72s/it] Traceback (most recent call last):
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/molfind.py", line 106, in analyze_all_frames
    df_formula, df_molecule = analyze_a_frame(
  File "/blue/roitberg/apps/lammps-ani/cumolfind/cumolfind/fragment.py", line 223, in analyze_a_frame
    torch.tensor(mdtraj_frame.xyz, device=device).float().view(1, -1, 3) * 10.0
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps/Code to reproduce bug

Relevant code snippet

import torch
import rmm
from rmm.allocators.torch import rmm_torch_allocator

# rmm resource logging
rmm.reinitialize(pool_allocator=True, managed_memory=True, maximum_pool_size=300 * 1024 * 1024 * 1024, logging=True, log_file_name="logging_resource.csv")

# Configure PyTorch to use RAPIDS Memory Manager (RMM) for GPU memory management.
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

A log is recorded using logging_resource_adaptor, attached at: logging_resource.dev0.csv.zip - Google Drive

The memory usage on the CPU is recorded every 10 seconds; the beginning of the run was missing: ram_log.txt

Environment details

Environment was created using

mamba create -n rapids-23.10 -c rapidsai -c conda-forge -c nvidia cudf=23.10 cugraph=23.10 python=3.10 cuda-version=11.8 

The analysis was run on an A100 GPU with 81920MiB memory. The environment is also attached: env.txt

yueyericardo avatar Jan 05 '24 22:01 yueyericardo