PyTorch Memory Management in GPU-to-CPU Transfers issue
🐛 Describe the bug
Introduction:
I am developing an application using PyTorch and have noticed an unusual behavior related to memory management. Specifically, when I instantiate a batch of tensors or a model and transfer it from the CPU to the GPU, only a portion of it is transferred to the GPU, which seems normal. However, when I transfer the tensors or model back from the GPU to the CPU, the entire size of the batch or model is moved back, leading to increased RAM usage. Despite explicitly invoking the garbage collector or using PyTorch functions to free memory, the RAM does not seem to be released (only GPU memory is freed).
Reproducing the Problem: Below is the code snippet that demonstrates this issue with a batch of tensors:
import gc
import torch
from memory_profiler import profile
INT_ITERATION = 3
@profile
def run_test():
with torch.no_grad():
batch_size = 300
tensor_size = (1000, 1000)
# Create a batch tensor in one line
batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
batch_tensors = batch_tensors.to('cuda')
batch_tensors = batch_tensors.to('cpu').detach()
# Print the size of the batch tensor
del batch_tensors
gc.collect()
torch.cuda.empty_cache()
if __name__ == "__main__":
print("PyTorch version:", torch.__version__)
if torch.cuda.is_available():
for i in range(INT_ITERATION):
print(f'******* Iteration num: {i+1} *********** \n')
run_test()
input("Press Enter to continue...")
else:
print('CUDA is not available')
To run the code and reproduce the issue, you’ll need to have the torch and memory_profiler packages installed in your Python environment.
Output and Observations: On my Ubuntu 20.04 machine with Torch 2.2.2 and CUDA 12.1 (I encountered the same issue on a Windows PC with Torch 2.1.0 and CUDA 12.1), I observed the following behavior:
PyTorch version: 2.2.2+cu121
PyTorch version: 2.2.2+cu121
******* Iteration num: 1 ***********
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 333.8 MiB 333.8 MiB 1 @profile
11 def run_test():
12 452.1 MiB 0.0 MiB 2 with torch.no_grad():
13 333.8 MiB 0.0 MiB 1 batch_size = 300
14 333.8 MiB 0.0 MiB 1 tensor_size = (1000, 1000)
15 # Create a batch tensor in one line
16 1481.0 MiB 1147.2 MiB 303 batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
17
18 451.8 MiB -1029.2 MiB 1 batch_tensors = batch_tensors.to('cuda')
19 1596.2 MiB 1144.4 MiB 1 batch_tensors = batch_tensors.to('cpu').detach()
20 # Print the size of the batch tensor
21 452.1 MiB -1144.2 MiB 1 del batch_tensors
22
23 452.1 MiB 0.0 MiB 1 gc.collect()
24 452.1 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 2 ***********
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 452.1 MiB 452.1 MiB 1 @profile
11 def run_test():
12 1596.8 MiB 0.0 MiB 2 with torch.no_grad():
13 452.1 MiB 0.0 MiB 1 batch_size = 300
14 452.1 MiB 0.0 MiB 1 tensor_size = (1000, 1000)
15 # Create a batch tensor in one line
16 2741.0 MiB 2288.9 MiB 303 batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
17
18 1596.8 MiB -1144.2 MiB 1 batch_tensors = batch_tensors.to('cuda')
19 2741.0 MiB 1144.2 MiB 1 batch_tensors = batch_tensors.to('cpu').detach()
20 # Print the size of the batch tensor
21 1596.8 MiB -1144.2 MiB 1 del batch_tensors
22
23 1596.8 MiB 0.0 MiB 1 gc.collect()
24 1596.8 MiB 0.0 MiB 1 torch.cuda.empty_cache()
******* Iteration num: 3 ***********
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
10 1596.8 MiB 1596.8 MiB 1 @profile
11 def run_test():
12 1635.3 MiB 0.0 MiB 2 with torch.no_grad():
13 1596.8 MiB 0.0 MiB 1 batch_size = 300
14 1596.8 MiB 0.0 MiB 1 tensor_size = (1000, 1000)
15 # Create a batch tensor in one line
16 2779.6 MiB 1182.8 MiB 303 batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
17
18 1635.3 MiB -1144.3 MiB 1 batch_tensors = batch_tensors.to('cuda')
19 2779.5 MiB 1144.2 MiB 1 batch_tensors = batch_tensors.to('cpu').detach()
20 # Print the size of the batch tensor
21 1635.3 MiB -1144.2 MiB 1 del batch_tensors
22
23 1635.3 MiB 0.0 MiB 1 gc.collect()
24 1635.3 MiB 0.0 MiB 1 torch.cuda.empty_cache()
Interestingly, after 3 to 4 iterations, the memory usage stabilizes, and there is no further increase. However, this initial behavior is particularly annoying because the first time I load the model or do some operation, I can use it with less memory compared to subsequent iterations.
Questions:
- Is this behavior expected in PyTorch, or could it be an bug?
- If this behavior is expected, is there a way to release all the Torch memory on the CPU without closing the thread?
Any insights or suggestions would be greatly appreciated. Thank you!
Versions
Collecting environment information... PyTorch version: 2.2.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.5 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce GTX 1080
Nvidia driver version: 535.161.08 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.1 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz Stepping: 4 CPU MHz: 1200.027 CPU max MHz: 4500,0000 CPU min MHz: 1200,0000 BogoMIPS: 7200.00 Virtualization: VT-x L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 8 MiB L3 cache: 11 MiB NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d arch_capabilities
Versions of relevant libraries: [pip3] numpy==1.24.2 [pip3] open-clip-torch==2.24.0 [pip3] torch==2.2.2 [pip3] torch-tb-profiler==0.4.3 [pip3] torchaudio==2.2.2 [pip3] torchvision==0.17.2 [pip3] triton==2.2.0 [conda] numpy 1.24.2 pypi_0 pypi [conda] open-clip-torch 2.24.0 pypi_0 pypi [conda] torch 2.2.2 pypi_0 pypi [conda] torch-tb-profiler 0.4.3 pypi_0 pypi [conda] torchaudio 2.2.2 pypi_0 pypi [conda] torchvision 0.17.2 pypi_0 pypi [conda] triton 2.2.0 pypi_0 pypi
I also had this problem, does anyone have any solution or idea how to solve it?
cc @albanD
Quick observation:
- The memory is always stable between the with no_grad and last line of the function.
- I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??
Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.
Interestingly, after 3 to 4 iterations, the memory usage stabilizes
Yeah, this sounds like normal allocator behavior. You can look into https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html, but I think this is probably:
- Outside the scope of PyTorch
- Unrelated to the GPU-CPU transfers.
You can also try some other allocator like https://jemalloc.net/, but you'll probably see similar behavior.
Thanks for the reply. I'll try to explain a bit better.
- The memory is always stable between the with no_grad and last line of the function.
Indeed, the memory remains stable between the with torch.no_grad() context manager and the final line of the function. However, this stability does not necessarily correlate with the amount of memory available at the start of the function.
To illustrate the memory usage throughout the function, I will add an additional output with a print statement to highlight the differences:
PyTorch version: 2.2.2+cu121
******* Iteration num: 1 ***********
Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
8 332.1 MiB 332.1 MiB 1 @profile
9 def run_test():
10
11 332.1 MiB 0.0 MiB 1 print('Start "with torch_no_grad()"')
12
13 449.8 MiB 0.0 MiB 2 with torch.no_grad():
14 332.1 MiB 0.0 MiB 1 batch_size = 300
15 332.1 MiB 0.0 MiB 1 tensor_size = (1000, 1000)
16 # Create a batch tensor in one line
17 1479.1 MiB 1147.0 MiB 303 batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
18
19 449.7 MiB -1029.5 MiB 1 batch_tensors = batch_tensors.to('cuda')
20 1594.1 MiB 1144.4 MiB 1 batch_tensors = batch_tensors.to('cpu').detach()
21 # Print the size of the batch tensor
22 449.8 MiB -1144.3 MiB 1 del batch_tensors
23
24 449.8 MiB 0.0 MiB 1 print('End "with torch_no_grad()"')
25
26 449.8 MiB 0.0 MiB 1 gc.collect()
27 449.8 MiB 0.0 MiB 1 torch.cuda.empty_cache()
28
29 449.8 MiB 0.0 MiB 1 print('End run_test()')
******* Iteration num: 2 ***********
Start "with torch_no_grad()"
End "with torch_no_grad()"
End run_test()
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
8 449.8 MiB 449.8 MiB 1 @profile
9 def run_test():
10
11 449.8 MiB 0.0 MiB 1 print('Start "with torch_no_grad()"')
12
13 1594.5 MiB 0.0 MiB 2 with torch.no_grad():
14 449.8 MiB 0.0 MiB 1 batch_size = 300
15 449.8 MiB 0.0 MiB 1 tensor_size = (1000, 1000)
16 # Create a batch tensor in one line
17 2738.7 MiB 2288.9 MiB 303 batch_tensors = torch.stack([torch.randn(tensor_size) for _ in range(batch_size)]).to('cpu')
18
19 1594.5 MiB -1144.2 MiB 1 batch_tensors = batch_tensors.to('cuda')
20 2738.7 MiB 1144.2 MiB 1 batch_tensors = batch_tensors.to('cpu').detach()
21 # Print the size of the batch tensor
22 1594.5 MiB -1144.2 MiB 1 del batch_tensors
23
24 1594.5 MiB 0.0 MiB 1 print('End "with torch_no_grad()"')
25
26 1594.5 MiB 0.0 MiB 1 gc.collect()
27 1594.5 MiB 0.0 MiB 1 torch.cuda.empty_cache()
28
29 1594.5 MiB 0.0 MiB 1 print('End run_test()')
- I'm not sure how to interpret on line have 452.1 and the next 333.8MB but the increment is 0 ??
The line with 452.1MB likely represents the memory allocated at the conclusion of the with torch.no_grad() block.
Also on cpu, the libc malloc is used for memory allocation. Depending on which memory measure you're looking at, there are quite a few cases where the allocator will keep memory around and not give it back to the OS. This would explain the first few iterations allocating more until malloc starts to properly re-use memory it has cached.
Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively? Would you recommend exploring alternative memory allocator implementations as a potential solution?
Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?
You can try using another malloc implementation like jemalloc but they will most likely have similar behavior. In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.
In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.
Is there a way to prevent this behavior, especially considering that in the second iteration, memory usage increases excessively?
You can try using another malloc implementation like jemalloc but they will most likely have similar behavior. In particular, as long as there is no memory pressure, it is usually faster to keep around memory as you can serve it faster.
In particular, unless you actually seem OOMs, it might just be keeping memory around to speed things up.
Thanks for the answer. I'll experiment with different malloc implementations to see if the behavior persists. My main concern is that this issue also occurs when loading and transferring models from CPU to GPU. I'm encountering out-of-memory errors. It seems strange that the model loads successfully the first few times, but then requires significantly more memory on subsequent attempts.