stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

VRAM usage unexpectedly high during traning, 3090 Windows Pytorch 1.7.1 python 3.8.3

Open chiwing4 opened this issue 3 years ago • 2 comments

I read that training 1024x1024 images requires around ~16GB GPU memory, but seems like the training is using much more VRAM than it should.

First of all, when my system is idle, VRAM usage is at 0.6 GB.

Then I tried stylegan2_pytorch --data train_data --network-capacity 16 --name model256 --image-size 256 to train at 256x256, VRAM usage first goes up a bit, and then goes up to 18.2 GB (See the image), which means the training is using around (18.2 - 0.6 ) = 17.6 GB VRAM, just for training 256x256? Image - Task Manger 256x256

I also tried to train at 1024*1024 by stylegan2_pytorch --data train_data --network-capacity 16 --name model1024 --image-size 1024 , and get RuntimeError: CUDA out of memory. Tried to allocate 1.25 GiB (GPU 0; 24.00 GiB total capacity; 20.94 GiB already allocated; 145.31 MiB free; 21.50 GiB reserved in total by PyTorch)

The training setting I am using now is stylegan2_pytorch --data train_data --aug-prob 0.3 --aug-types [translation,cutout,color] --network-capacity 16 --name mo256 --image-size 256 --save_every 100 --evaluate_every 100 --attn-layers 1 --no_pl_reg --top-k-training, which the VRAM usage is 23.0 GB / 24.0 GB.

Are those behavior expected or something is wrong? RTX 3090 Founders Edition, Windows 10, Pytorch 1.7.1 (CUDA 11.0), python 3.8.3

chiwing4 avatar Feb 09 '21 16:02 chiwing4

You likely should just reduce the batch size ... if you go very low, you may want to accumulate gradients every 2 / 4 / 8 batches (i.e. if you're getting down to small single digit batch sizes). I can run it on my 10 GB 3080 at 1024 using batch size 1 and network capacity of 8, with no attention.

RhynoTime avatar Feb 21 '21 04:02 RhynoTime

I'm also seeing the same issue with a 3090 on Ubuntu 20.04 tried using apex it lowered the initial memory load but ran into NaN issues which crashed the training.

jbartolozzi avatar Mar 17 '21 18:03 jbartolozzi