pytorch-lightning
pytorch-lightning copied to clipboard
Memory and CPU leak in running lightning Apps for long time
🐛 Bug
By looking at the memory usage over a long period of time (a few weeks). It looks like there's a pattern of a memory leak and CPU growing usage.
Example of two different apps:
To Reproduce
keep a lightning app running on the cloud for a long period of time and watch memory usage.
Expected behavior
Memory usage and CPU usage should be stable and don't grow, otherwise the app will crash and run out of memory at some point.
Environment
- PyTorch Lightning Version (e.g., 1.5.0):
- PyTorch Version (e.g., 1.10):
- Python version (e.g., 3.9):
- OS (e.g., Linux):
- CUDA/cuDNN version:
- GPU models and configuration:
- How you installed PyTorch (
conda
,pip
, source): - If compiling from source, the output of
torch.__config__.show()
: - Any other relevant information:
Additional context
cc @tchaton @rohitgr7
I also faced this bug. To reproduce
import gc
import pytorch_lightning as pl
gc.enable()
gc.set_debug(gc.DEBUG_LEAK)
gc.collect()
assert not gc.garbage, f"{len(gc.garbage)} object found."
raises AssertionError. 22 objects are in gc.garbage for version 1.5.4 and 496 objects found for version 1.7.6. Expected 0 (zero)
FYI, the same thing happens if I delete pl
with del pl
command on line 3