pytorch-forecasting icon indicating copy to clipboard operation
pytorch-forecasting copied to clipboard

Multi GPU Memory keeps increasing while training TFT

Open fabsen opened this issue 3 years ago • 5 comments

  • PyTorch-Forecasting version: 0.8.4
  • PyTorch version: 1.8.1+cu102
  • Python version: Python 3.6.11
  • Operating System: Linux

Expected behavior

I follow the tft tutorial but want to train on multiple GPUs.

Actual behavior

RAM usage increase drastically over time until we get a memory error (Cannot allocate memory ...)

Changing to log_interval=-1 gets rid of the problem. Also training on one GPU only doesn't increase RAM usage.

Code to reproduce the problem

Steps that differ from the tutorial:

  1. Omit the "learning rate finder" part
  2. add/replace these two lines in the pl.Trainer. gpus=[0, 1], accelerator='ddp',
  3. Increase max_epochs and early stopping such that it doesn't stop early

/edit: For clarification: RAM usage keeps increasing, not VRAM (which is okay).

fabsen avatar May 05 '21 13:05 fabsen

Oh. This is interesting. Probably the figures are not probably closed. Thanks for pointing out. I wonder if this is an issue related to the pytorch lightning TensorboardLogger.

jdb78 avatar May 09 '21 11:05 jdb78

This still seems to be an issue. I had been training in a Docker container and thus not seeing the plots.

After training completed when not using a container, my system would almost crash from the sheer number of figures being opened. I will take a look at fixing the plot generation issue.

alexcolpitts96 avatar Apr 26 '22 15:04 alexcolpitts96

@fabsen Thank you for the solution of log_interval=-1. I faced the same issue while training in ddp mode on 4x NVidia V100. This was a major hurdle for scalability. Libraries I'm using:

pytorch-forecasting==0.9.0 pytorch-lightning==1.6.5 torch==1.11.0 torchmetrics==0.5.0

sayanb-7c6 avatar Sep 12 '22 06:09 sayanb-7c6

I experienced the same issue today after trying to upgrade my environment after a while. I was already using log_interval = -1

I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using os.environ["CUDA_VISIBLE_DEVICES"]

library versions are:

pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0

galigutta avatar Nov 13 '23 06:11 galigutta

I experienced the same issue today after trying to upgrade my environment after a while. I was already using log_interval = -1

I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using os.environ["CUDA_VISIBLE_DEVICES"]

library versions are:

pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0

I am facing with same problem. Tried log_interval=-1 but did not make any difference. Did you able to solve it?

furkanbr avatar Jan 11 '24 06:01 furkanbr