dreamerv3 icon indicating copy to clipboard operation
dreamerv3 copied to clipboard

RESOURCE_EXHAUSTED error after multiple hours of training

Open Max-We opened this issue 1 year ago • 2 comments

Problem

I'm running the main.py training script inside the docker-container from the Dockerfile and after a few hours of training (~12h, ~800k timesteps) on an RTX 3090 I get the following error:

1778 ValueError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 14296180248 bytes.

Setup

The system metrics look stable and don't show any signs of running out of vRAM / RAM:

Screenshot from 2024-09-03 21-41-00 Screenshot from 2024-09-03 21-41-08

The command I'm using to start the training is the following (tetris is a custom environment I implemented, and it uses a config comparable to Atari).

/app/dreamerv3/main.py --logdir /root/logdir/tetris-dreamer/{timestamp} --configs tetris size200m
tetris:
  task: tetris_rgb
  wandb_project: 'dreamer-tetris'
  (enc|dec).simple.minres: 6
  env.tetris.size: [96, 96]
  run:
    steps: 200e6
    train_ratio: 64
  enc.spaces: 'image'
  dec.spaces: 'image'

Questions

It looks like this error is due to the GPU running out of memory. The DreamerV3 paper mentions that training was conducted on an A100, but it's unclear how much vRAM (40 / 80GB version?) / RAM is required to train the model. I suspect the 24GB RTX 3090 would not be enough for the 200M model, but it could also be an error in the implementation. Are there heuristics to get a rough estimate for the different model sizes / batch sizes before starting the training to avoid failure after multiple hours?

Thank you for maintaining the project!

Max-We avatar Sep 03 '24 19:09 Max-We

Most likely the replay buffer does not fit into the RAM of your machine. You can try training with replay.size: 1e6.

danijar avatar Sep 04 '24 22:09 danijar

Thank you, but the system / process memory in use are really low, it seems to be more than enough. Am I missing something?

Max-We avatar Sep 05 '24 09:09 Max-We

Thank you, but the system / process memory in use are really low, it seems to be more than enough. Am I missing something?

I was able to resolve the issue. For anyone else running into a similar problem: It is indeed the system running out of memory (RAM) due to the replay buffer becoming too large. The issue lies in the W&B logging because the "System" tab shows inaccurate numbers. You can find out how large the replay buffer is, by checking the replay/ram_gb and replay/items metrics.

image

That way, you can calculate items / GB and find out what replay buffer size (replay.size) will work for your machine. However, the Process Memory In Use (non-swap) (%) metric in the W&B "System" tab appears inaccurate, as it shows more memory than actually available.

Max-We avatar Oct 08 '24 15:10 Max-We