stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

8 GB Insufficient to Train Image Size of 512?

Open athenawisdoms opened this issue 3 years ago • 5 comments

Hello again @lucidrains & StyleGANers!

I tried training using --network-capacity 10 --attn-layers 1 --batch-size 1 gradient-accumulate-every 32 --image-size 512 on a Nvidia 2070 Super with 8 GB of GDDR. The program trains for about 5000 iterations in just under 30 hours but suddenly crashes with the error

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.53 GiB already allocated; 71.00 MiB free; 6.92 GiB reserved in total by PyTorch)

Tried resuming the training but got a similar error within 24 iterations:

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 7.80 GiB total capacity; 6.54 GiB already allocated; 59.00 MiB free; 6.94 GiB reserved in total by PyTorch)

Using --fp16 does not seem to reduce the GPU memory usage, seems to be slightly slower than without fp16, and tends to give a NaN error during my limited tries with it.

  1. Is it possible to let PyTorch reserve more than 6.92 GiB of memory? The card has an additional 0.9 GiB memory for PyTorch to use. When PyTorch is not running, nvidia-smi reports 1 MB of memory usage on this card.

  2. If not, what parameters do you suggest to change so that we can continue training the model with --image-size 512? Batch size is already 1, and network-capacity is already quite low compared to the default values.

Thank you!

athenawisdoms avatar Sep 25 '20 14:09 athenawisdoms

@athenawisdoms Hi Athena! My recommendation for you would be to buy Google Colab Pro for 10$ a month https://colab.research.google.com/signup, roll a V100 (16GB), and train on there

lucidrains avatar Sep 25 '20 17:09 lucidrains

This same problem happened to me. Different data configuration and GPU (RTX 2070) but I also had this error at 5024 iterations. I sunk a ton of time into getting it set up on my Windows machines because I'm a inexperienced. Using Google Colab Pro was well worth the cost, I wish I had started there. The V100 is not really any faster than my GPU, but it only took a few minutes to set up and deploy. Plus I can use my computer again!

terriblewitlogic avatar Oct 22 '20 01:10 terriblewitlogic

FYI, if you use a consumer GPU, 20% of the VRAM is reserved for Windows by Nvidia's driver. This is for display purposes. Titans/Quadros can disable this, but cannot display anything with it disabled. V100 and similar GPUs don't have a display output, so this isn't a consideration for them.

bob80333 avatar Oct 26 '20 00:10 bob80333

A little late, but wanted to drop in and say I'm running fine at 512x512 with a GTX 1080 8GB vram with version 1.5.1 of this repository. Currently on 39k iterations. The ram usage is holding steady at 7.5GB for me, so its certainly cutting it close.

I'm using the following params: --name 21k-512-aug-fp16-1.5.1 --data ../stylegan2/dataset --image-size 512 --fp16 --aug-prob 0.3 --aug-types [translation,cutout,color] --top-k-training --calculate-fid-every 5000 --batch-size 3 --gradient-accumulate-every 8

Also fmap_max seems to use up a good chunk of memory. I set it lower in cases where I was seeing OOM errors.

Hope it helps

trufty avatar Nov 26 '20 04:11 trufty

I have found the same for me: training crashes at 5024 iterations. This is caused by automatic enabling of additional loss, namely path regularization (PL). So you should choose the batch size appropriately, keeping in mind that more memory will be consumed later. I've had 7 GB consumption in total before 5024 iters, and got 4 GB more after :) Image size 128, Batch size 12, grad accumulation 4.

GLivshits avatar Jun 22 '21 06:06 GLivshits