stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

RuntimeError: Cuda out of memory at 5024th iteration

Open aniief opened this issue 3 years ago • 2 comments

I train my dataset by the following commands:

stylegan2_pytorch --data trees_128 ^ --name stylegan4trees_128_12_17 ^
    --image_size 128 ^
    --network_capacity 16 ^
    --batch_size 8 ^
    --gradient_accumulate_every 16 ^
    --num_train_steps 100000 ^
    --save_every 500 ^
    --evaluate_every 1000

but I got OOM until iteration 5024.

stylegan4trees_128_12_17<trees_128>:   5%|███████              | 5024/100000 [03:02<218:45:32,  8.2
stylegan4trees_128_12_17<trees_128>:   5%|███████              | 5024/100000 [03:10<209:28:11,  7.94s/it]
Traceback (most recent call last):
  File "cli.py", line 229, in <module>
    main()
  File "cli.py", line 226, in main
    log=config['log']
  File "cli.py", line 168, in train_from_folder
    run_training(0, 1, model_args, data, load_from, new, num_train_steps, name, seed)
  File "cli.py", line 57, in run_training
    retry_call(model.train, tries=3, exceptions=NanException)
  File "C:\Tool\Anaconda3\envs\py36\lib\site-packages\retry\api.py", line 101, in retry_call
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
  File "C:\Tool\Anaconda3\envs\py36\lib\site-packages\retry\api.py", line 33, in __retry_internal
    return f()
  File "C:\Users\fan\Desktop\fun\GAN\stylegan2\stylegan2_pytorch\stylegan2_pytorch.py", line 1000, in train
    pl_lengths = calc_pl_lengths(w_styles, generated_images)
  File "C:\Users\fan\Desktop\fun\GAN\stylegan2\stylegan2_pytorch\stylegan2_pytorch.py", line 202, in calc_pl_lengths
    create_graph=True, retain_graph=True, only_inputs=True)[0]
  File "C:\Tool\Anaconda3\envs\py36\lib\site-packages\torch\autograd\__init__.py", line 204, in grad
    inputs, allow_unused)
RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 8.00 GiB total capacity; 4.93 GiB already allocated; 0 bytes free; 5.13 GiB reserved in total by PyTorch)

Both in Win10, pytorch 1.7, Cuda11.0 , Nvidia 3070 and Ubuntu18.04,Cuda10.2,pytorch 1.7 with 3* 2080Ti. On both sides, the error occurs only until iteration 5024.

aniief avatar Dec 20 '20 06:12 aniief

@zjfan7 Hi ZJ! So the reason is probably because path length regularization turns on a while after starting training, and can be quite memory-intensive

If you are stretched for resources, you can try turning this off with --no-pl-reg flag

lucidrains avatar Dec 20 '20 18:12 lucidrains

It worked!!!!! Thank you very much!! The program has not terminated until now by using the flag --no-pl-reg, and now it has reached the iteration 13k. I will read you code carefully to learn. Thank you!

aniief avatar Dec 22 '20 04:12 aniief