pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Memory and CPU leak in running lightning Apps for long time

Open manskx opened this issue 2 years ago • 5 comments

🐛 Bug

By looking at the memory usage over a long period of time (a few weeks). It looks like there's a pattern of a memory leak and CPU growing usage.

Example of two different apps:

image image

To Reproduce

keep a lightning app running on the cloud for a long period of time and watch memory usage.

Expected behavior

Memory usage and CPU usage should be stable and don't grow, otherwise the app will crash and run out of memory at some point.

Environment

  • PyTorch Lightning Version (e.g., 1.5.0):
  • PyTorch Version (e.g., 1.10):
  • Python version (e.g., 3.9):
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

cc @tchaton @rohitgr7

manskx avatar Jul 21 '22 09:07 manskx

I also faced this bug. To reproduce

import gc
import pytorch_lightning as pl

gc.enable()
gc.set_debug(gc.DEBUG_LEAK)
gc.collect()
assert not gc.garbage, f"{len(gc.garbage)} object found."

raises AssertionError. 22 objects are in gc.garbage for version 1.5.4 and 496 objects found for version 1.7.6. Expected 0 (zero)

Enolerobotti avatar Sep 16 '22 09:09 Enolerobotti

FYI, the same thing happens if I delete pl with del pl command on line 3

Enolerobotti avatar Sep 16 '22 09:09 Enolerobotti