pytorch-forecasting
pytorch-forecasting copied to clipboard
Multi GPU Memory keeps increasing while training TFT
- PyTorch-Forecasting version: 0.8.4
- PyTorch version: 1.8.1+cu102
- Python version: Python 3.6.11
- Operating System: Linux
Expected behavior
I follow the tft tutorial but want to train on multiple GPUs.
Actual behavior
RAM usage increase drastically over time until we get a memory error (Cannot allocate memory ...)
Changing to log_interval=-1
gets rid of the problem.
Also training on one GPU only doesn't increase RAM usage.
Code to reproduce the problem
Steps that differ from the tutorial:
- Omit the "learning rate finder" part
- add/replace these two lines in the pl.Trainer.
gpus=[0, 1], accelerator='ddp',
- Increase max_epochs and early stopping such that it doesn't stop early
/edit: For clarification: RAM usage keeps increasing, not VRAM (which is okay).
Oh. This is interesting. Probably the figures are not probably closed. Thanks for pointing out. I wonder if this is an issue related to the pytorch lightning TensorboardLogger.
This still seems to be an issue. I had been training in a Docker container and thus not seeing the plots.
After training completed when not using a container, my system would almost crash from the sheer number of figures being opened. I will take a look at fixing the plot generation issue.
@fabsen Thank you for the solution of log_interval=-1
. I faced the same issue while training in ddp mode on 4x NVidia V100. This was a major hurdle for scalability. Libraries I'm using:
pytorch-forecasting==0.9.0 pytorch-lightning==1.6.5 torch==1.11.0 torchmetrics==0.5.0
I experienced the same issue today after trying to upgrade my environment after a while. I was already using log_interval = -1
I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using os.environ["CUDA_VISIBLE_DEVICES"]
library versions are:
pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0
I experienced the same issue today after trying to upgrade my environment after a while. I was already using
log_interval = -1
I am not using DDP mode either. I do have a multiple GPU setup but am only using 1 GPU at a time using
os.environ["CUDA_VISIBLE_DEVICES"]
library versions are:
pytorch-forecasting==1.0.0 pytorch-lightning==2.1.1 torch==2.0.1 torchmetrics==1.2.0
I am facing with same problem. Tried log_interval=-1 but did not make any difference. Did you able to solve it?