UGATIT-pytorch icon indicating copy to clipboard operation
UGATIT-pytorch copied to clipboard

CUDA out of memory (with light flag)

Open artempimushkin opened this issue 6 years ago • 9 comments

Hi guys! I'm using RTX 2080Ti 11GB. At first I tried to train on a dataset of 100K images (1000px) with --light flag . And after the 1000th epoch I got the "CUDA out of memory" error. Then I tried a smaller dataset of 10K images (256px) and got the same error after the 1000th epoch. Finally I tried 3400 images (256px) and there were no changes.

Here is an output:

[  997/1000000] time: 582.6236 d_loss: 0.00474171, g_loss: 1344.68078613
[  998/1000000] time: 583.2094 d_loss: 0.00624988, g_loss: 1328.24572754
[  999/1000000] time: 583.7950 d_loss: 0.00641153, g_loss: 1374.71826172
[ 1000/1000000] time: 584.3810 d_loss: 0.00178387, g_loss: 1280.08032227
/home/p0wx/prj/UGATIT-pytorch/utils.py:46: RuntimeWarning: invalid value encountered in true_divide
  cam_img = x / np.max(x)
Traceback (most recent call last):
  File "main.py", line 83, in <module>
    main()
  File "main.py", line 75, in main
    gan.train()
  File "/home/p0wx/prj/UGATIT-pytorch/UGATIT.py", line 209, in train
    fake_B2B, fake_B2B_cam_logit, _ = self.genA2B(real_B)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 108, in forward
    out = self.UpBlock2(x)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
    result = self.forward(*input, **kwargs)
  File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 191, in forward
    out = self.rho.expand(input.shape[0], -1, -1, -1) * out_in + (1-self.rho.expand(input.shape[0], -1, -1, -1)) * out_ln
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.32 GiB already allocated; 5.56 MiB free; 621.27 MiB cached)```

artempimushkin avatar Aug 15 '19 16:08 artempimushkin

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

Frizy-up avatar Aug 20 '19 14:08 Frizy-up

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

Frizy-up avatar Aug 21 '19 01:08 Frizy-up

Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?

I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.

So every 1000th step it requires 100+mb more and not releasing it? Asking since i'm facing same problem but on 2000th epoch and my images are 256x256

DaddyWesker avatar Aug 23 '19 09:08 DaddyWesker

Same error appear on my test, When I set print_freq =10000, it works.

lxy2017 avatar Oct 25 '19 07:10 lxy2017

Apparently there is a bug in pytorch , when you open a new dataloader , it seems that the older dataloader will not be released , I have meet this so many times.

heartInsert avatar Nov 20 '19 09:11 heartInsert

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!

07hyx06 avatar Nov 23 '19 08:11 07hyx06

You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"! it has used self.genA2B.eval() in code

scutlrr avatar Apr 22 '20 13:04 scutlrr

I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

shafeeq07 avatar Jun 15 '20 18:06 shafeeq07

@shafeeqbsse I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64

can you reproduce the results in paper? my results are bad with this network(ch=32) :(

nzhang258 avatar Jul 23 '20 02:07 nzhang258