pytorch PyTorch Memory Management in GPU-to-CPU Transfers issue

🐛 Describe the bug

Introduction:

I am developing an application using PyTorch and have noticed an unusual behavior related to memory management. Specifically, when I instantiate a batch of tensors or a model and transfer it from the CPU to the GPU, only a portion of it is transferred to the GPU, which seems normal. However, when I transfer the tensors or model back from the GPU to the CPU, the entire size of the batch or model is moved back, leading to increased RAM usage. Despite explicitly invoking the garbage collector or using PyTorch functions to free memory, the RAM does not seem to be released (only GPU memory is freed).

Reproducing the Problem: Below is the code snippet that demonstrates this issue with a batch of tensors:

import gc
import torch
from memory_profiler import profile

INT_ITERATION = 3


@profile
def run_test():    
    with torch.no_grad():
        batch_size = 300
        tensor_size = (1000, 1000)
        # Create a batch tensor in one line
        batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
        
        batch_tensors = batch_tensors.to('cuda')
        batch_tensors = batch_tensors.to('cpu').detach()
        # Print the size of the batch tensor
        del batch_tensors

	gc.collect()
	torch.cuda.empty_cache()


if __name__ == "__main__":
    print("PyTorch version:", torch.__version__)

    if torch.cuda.is_available():
        for i in range(INT_ITERATION):
            print(f'******* Iteration num: {i+1} *********** \n')
            run_test()

        input("Press Enter to continue...")
    
    else:
        print('CUDA is not available')

To run the code and reproduce the issue, you’ll need to have the torch and memory_profiler packages installed in your Python environment.

Output and Observations: On my Ubuntu 20.04 machine with Torch 2.2.2 and CUDA 12.1 (I encountered the same issue on a Windows PC with Torch 2.1.0 and CUDA 12.1), I observed the following behavior:

PyTorch version: 2.2.2+cu121
PyTorch version: 2.2.2+cu121
******* Iteration num: 1 *********** 

Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10    333.8 MiB    333.8 MiB           1   @profile
    11                                         def run_test():    
    12    452.1 MiB      0.0 MiB           2       with torch.no_grad():
    13    333.8 MiB      0.0 MiB           1           batch_size = 300
    14    333.8 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    15                                                 # Create a batch tensor in one line
    16   1481.0 MiB   1147.2 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    17                                                 
    18    451.8 MiB  -1029.2 MiB           1           batch_tensors = batch_tensors.to('cuda')
    19   1596.2 MiB   1144.4 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    20                                                 # Print the size of the batch tensor
    21    452.1 MiB  -1144.2 MiB           1           del batch_tensors
    22                                         
    23    452.1 MiB      0.0 MiB           1       gc.collect()
    24    452.1 MiB      0.0 MiB           1       torch.cuda.empty_cache()


******* Iteration num: 2 *********** 

Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10    452.1 MiB    452.1 MiB           1   @profile
    11                                         def run_test():    
    12   1596.8 MiB      0.0 MiB           2       with torch.no_grad():
    13    452.1 MiB      0.0 MiB           1           batch_size = 300
    14    452.1 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    15                                                 # Create a batch tensor in one line
    16   2741.0 MiB   2288.9 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    17                                                 
    18   1596.8 MiB  -1144.2 MiB           1           batch_tensors = batch_tensors.to('cuda')
    19   2741.0 MiB   1144.2 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    20                                                 # Print the size of the batch tensor
    21   1596.8 MiB  -1144.2 MiB           1           del batch_tensors
    22                                         
    23   1596.8 MiB      0.0 MiB           1       gc.collect()
    24   1596.8 MiB      0.0 MiB           1       torch.cuda.empty_cache()


******* Iteration num: 3 *********** 

Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    10   1596.8 MiB   1596.8 MiB           1   @profile
    11                                         def run_test():    
    12   1635.3 MiB      0.0 MiB           2       with torch.no_grad():
    13   1596.8 MiB      0.0 MiB           1           batch_size = 300
    14   1596.8 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    15                                                 # Create a batch tensor in one line
    16   2779.6 MiB   1182.8 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    17                                                 
    18   1635.3 MiB  -1144.3 MiB           1           batch_tensors = batch_tensors.to('cuda')
    19   2779.5 MiB   1144.2 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    20                                                 # Print the size of the batch tensor
    21   1635.3 MiB  -1144.2 MiB           1           del batch_tensors
    22                                         
    23   1635.3 MiB      0.0 MiB           1       gc.collect()
    24   1635.3 MiB      0.0 MiB           1       torch.cuda.empty_cache()

Interestingly, after 3 to 4 iterations, the memory usage stabilizes, and there is no further increase. However, this initial behavior is particularly annoying because the first time I load the model or do some operation, I can use it with less memory compared to subsequent iterations.

Questions:

Is this behavior expected in PyTorch, or could it be an bug?
If this behavior is expected, is there a way to release all the Torch memory on the CPU without closing the thread?

Any insights or suggestions would be greatly appreciated. Thank you!

Versions

Collecting environment information... PyTorch version: 2.2.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce GTX 1080

Nvidia driver version: 535.161.08 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.1 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz Stepping: 4 CPU MHz: 1200.027 CPU max MHz: 4500,0000 CPU min MHz: 1200,0000 BogoMIPS: 7200.00 Virtualization: VT-x L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 8 MiB L3 cache: 11 MiB NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d arch_capabilities

Versions of relevant libraries: [pip3] numpy==1.24.2 [pip3] open-clip-torch==2.24.0 [pip3] torch==2.2.2 [pip3] torch-tb-profiler==0.4.3 [pip3] torchaudio==2.2.2 [pip3] torchvision==0.17.2 [pip3] triton==2.2.0 [conda] numpy 1.24.2 pypi_0 pypi [conda] open-clip-torch 2.24.0 pypi_0 pypi [conda] torch 2.2.2 pypi_0 pypi [conda] torch-tb-profiler 0.4.3 pypi_0 pypi [conda] torchaudio 2.2.2 pypi_0 pypi [conda] torchvision 0.17.2 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi

Apr 19 '24 14:04 AntoGuer

I also had this problem, does anyone have any solution or idea how to solve it?

Apr 22 '24 13:04 gdacciaro

cc @albanD

Apr 22 '24 18:04 colesbury

Quick observation:

The memory is always stable between the with no_grad and last line of the function.
I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??

Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.

Apr 22 '24 18:04 albanD

Interestingly, after 3 to 4 iterations, the memory usage stabilizes

Yeah, this sounds like normal allocator behavior. You can look into https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html, but I think this is probably:

Outside the scope of PyTorch
Unrelated to the GPU-CPU transfers.

You can also try some other allocator like https://jemalloc.net/, but you'll probably see similar behavior.

Apr 22 '24 18:04 colesbury

Thanks for the reply. I'll try to explain a bit better.

The memory is always stable between the with no_grad and last line of the function.

Indeed, the memory remains stable between the with torch.no_grad() context manager and the final line of the function. However, this stability does not necessarily correlate with the amount of memory available at the start of the function. To illustrate the memory usage throughout the function, I will add an additional output with a print statement to highlight the differences:

PyTorch version: 2.2.2+cu121
******* Iteration num: 1 *********** 

Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8    332.1 MiB    332.1 MiB           1   @profile
     9                                         def run_test():
    10                                             
    11    332.1 MiB      0.0 MiB           1       print('Start "with torch_no_grad()"')    
    12                                             
    13    449.8 MiB      0.0 MiB           2       with torch.no_grad():
    14    332.1 MiB      0.0 MiB           1           batch_size = 300
    15    332.1 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    16                                                 # Create a batch tensor in one line
    17   1479.1 MiB   1147.0 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    18                                                 
    19    449.7 MiB  -1029.5 MiB           1           batch_tensors = batch_tensors.to('cuda')
    20   1594.1 MiB   1144.4 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    21                                                 # Print the size of the batch tensor
    22    449.8 MiB  -1144.3 MiB           1           del batch_tensors
    23                                         
    24    449.8 MiB      0.0 MiB           1       print('End "with torch_no_grad()"')    
    25                                         
    26    449.8 MiB      0.0 MiB           1       gc.collect()
    27    449.8 MiB      0.0 MiB           1       torch.cuda.empty_cache()
    28                                         
    29    449.8 MiB      0.0 MiB           1       print('End run_test()')


******* Iteration num: 2 *********** 

Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8    449.8 MiB    449.8 MiB           1   @profile
     9                                         def run_test():
    10                                             
    11    449.8 MiB      0.0 MiB           1       print('Start "with torch_no_grad()"')    
    12                                             
    13   1594.5 MiB      0.0 MiB           2       with torch.no_grad():
    14    449.8 MiB      0.0 MiB           1           batch_size = 300
    15    449.8 MiB      0.0 MiB           1           tensor_size = (1000, 1000)
    16                                                 # Create a batch tensor in one line
    17   2738.7 MiB   2288.9 MiB         303           batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
    18                                                 
    19   1594.5 MiB  -1144.2 MiB           1           batch_tensors = batch_tensors.to('cuda')
    20   2738.7 MiB   1144.2 MiB           1           batch_tensors = batch_tensors.to('cpu').detach()
    21                                                 # Print the size of the batch tensor
    22   1594.5 MiB  -1144.2 MiB           1           del batch_tensors
    23                                         
    24   1594.5 MiB      0.0 MiB           1       print('End "with torch_no_grad()"')    
    25                                         
    26   1594.5 MiB      0.0 MiB           1       gc.collect()
    27   1594.5 MiB      0.0 MiB           1       torch.cuda.empty_cache()
    28                                         
    29   1594.5 MiB      0.0 MiB           1       print('End run_test()')

I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??

The line with 452.1MB likely represents the memory allocated at the conclusion of the with torch.no_grad() block.

Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively? Would you recommend exploring alternative memory allocator implementations as a potential solution?

Apr 29 '24 07:04 AntoGuer

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?

You can try using another malloc implementation like jemalloc but they will most likely have similar behavior. In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.

In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.

Apr 30 '24 22:04 albanD

Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?

You can try using another malloc implementation like jemalloc but they will most likely have similar behavior. In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.

In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.

Thanks for the answer. I'll experiment with different malloc implementations to see if the behavior persists. My main concern is that this issue also occurs when loading and transferring models from CPU to GPU. I'm encountering out-of-memory errors. It seems strange that the model loads successfully the first few times, but then requires significantly more memory on subsequent attempts.

May 02 '24 08:05 AntoGuer