RESOURCE_EXHAUSTED error after multiple hours of training
Problem
I'm running the main.py training script inside the docker-container from the Dockerfile and after a few hours of training (~12h, ~800k timesteps) on an RTX 3090 I get the following error:
1778 ValueError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 14296180248 bytes.
Setup
The system metrics look stable and don't show any signs of running out of vRAM / RAM:
The command I'm using to start the training is the following (tetris is a custom environment I implemented, and it uses a config comparable to Atari).
/app/dreamerv3/main.py --logdir /root/logdir/tetris-dreamer/{timestamp} --configs tetris size200m
tetris:
task: tetris_rgb
wandb_project: 'dreamer-tetris'
(enc|dec).simple.minres: 6
env.tetris.size: [96, 96]
run:
steps: 200e6
train_ratio: 64
enc.spaces: 'image'
dec.spaces: 'image'
Questions
It looks like this error is due to the GPU running out of memory. The DreamerV3 paper mentions that training was conducted on an A100, but it's unclear how much vRAM (40 / 80GB version?) / RAM is required to train the model. I suspect the 24GB RTX 3090 would not be enough for the 200M model, but it could also be an error in the implementation. Are there heuristics to get a rough estimate for the different model sizes / batch sizes before starting the training to avoid failure after multiple hours?
Thank you for maintaining the project!
Most likely the replay buffer does not fit into the RAM of your machine. You can try training with replay.size: 1e6.
Thank you, but the system / process memory in use are really low, it seems to be more than enough. Am I missing something?
Thank you, but the system / process memory in use are really low, it seems to be more than enough. Am I missing something?
I was able to resolve the issue. For anyone else running into a similar problem: It is indeed the system running out of memory (RAM) due to the replay buffer becoming too large. The issue lies in the W&B logging because the "System" tab shows inaccurate numbers. You can find out how large the replay buffer is, by checking the replay/ram_gb and replay/items metrics.
That way, you can calculate items / GB and find out what replay buffer size (replay.size) will work for your machine. However, the Process Memory In Use (non-swap) (%) metric in the W&B "System" tab appears inaccurate, as it shows more memory than actually available.