UGATIT-pytorch
UGATIT-pytorch copied to clipboard
CUDA out of memory (with light flag)
Hi guys! I'm using RTX 2080Ti 11GB. At first I tried to train on a dataset of 100K images (1000px) with --light flag . And after the 1000th epoch I got the "CUDA out of memory" error. Then I tried a smaller dataset of 10K images (256px) and got the same error after the 1000th epoch. Finally I tried 3400 images (256px) and there were no changes.
Here is an output:
[ 997/1000000] time: 582.6236 d_loss: 0.00474171, g_loss: 1344.68078613
[ 998/1000000] time: 583.2094 d_loss: 0.00624988, g_loss: 1328.24572754
[ 999/1000000] time: 583.7950 d_loss: 0.00641153, g_loss: 1374.71826172
[ 1000/1000000] time: 584.3810 d_loss: 0.00178387, g_loss: 1280.08032227
/home/p0wx/prj/UGATIT-pytorch/utils.py:46: RuntimeWarning: invalid value encountered in true_divide
cam_img = x / np.max(x)
Traceback (most recent call last):
File "main.py", line 83, in <module>
main()
File "main.py", line 75, in main
gan.train()
File "/home/p0wx/prj/UGATIT-pytorch/UGATIT.py", line 209, in train
fake_B2B, fake_B2B_cam_logit, _ = self.genA2B(real_B)
File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 108, in forward
out = self.UpBlock2(x)
File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/p0wx/.local/share/virtualenvs/p0wx-AcgHNkMk/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/p0wx/prj/UGATIT-pytorch/networks.py", line 191, in forward
out = self.rho.expand(input.shape[0], -1, -1, -1) * out_in + (1-self.rho.expand(input.shape[0], -1, -1, -1)) * out_ln
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.32 GiB already allocated; 5.56 MiB free; 621.27 MiB cached)```
Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?
Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?
I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.
Same error appear on my test, when I was trained on the 3000 step.(1080ti). I don't know why?
I know the reason why error appeared at 3000 step. That because at 3000 step,one epoch has finished, but the memory not released, they need more memory(about 100+MB) to start a new data loader. When i chanded the input image size to smaller, it works.
So every 1000th step it requires 100+mb more and not releasing it? Asking since i'm facing same problem but on 2000th epoch and my images are 256x256
Same error appear on my test, When I set print_freq =10000, it works.
Apparently there is a bug in pytorch , when you open a new dataloader , it seems that the older dataloader will not be released , I have meet this so many times.
You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"!
You can open the UGATIT.py then add "with torch.no_grad()" before "step%print_freq"! it has used self.genA2B.eval() in code
I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64
@shafeeqbsse I was getting "CUDA out of memory" error at beginning of training, I solved it by setting a lower base channel number than the default one (via --ch). I used 32 while default is 64
can you reproduce the results in paper? my results are bad with this network(ch=32) :(