edge-connect icon indicating copy to clipboard operation
edge-connect copied to clipboard

Runtime error when training joint model

Open ShnitzelKiller opened this issue 5 years ago • 7 comments

When I attempt to train the joint model (mode 4), I get the following runtime error:

start training...



Training epoch: 1
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    main(mode=1)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main
    model.train()
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train
    self.edge_model.backward(e_gen_loss, e_dis_loss)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward
    gen_loss.backward()
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?

ShnitzelKiller avatar May 20 '19 05:05 ShnitzelKiller

modify dis_loss.backward(retain_graph = True) in models.py , line 256.

superior1993 avatar May 20 '19 07:05 superior1993

I only get past this error if I modify gen_loss.backward(retain_graph=True), but this causes a CUDA out of memory error on a Titan Xp GPU. Do I need significantly more than 11GB to train the joint model?

ShnitzelKiller avatar May 20 '19 19:05 ShnitzelKiller

Hi, I met the same problem, have you solved it? Thanks!

xiaoj45 avatar Aug 05 '20 08:08 xiaoj45

@ShnitzelKiller and @xiaoj45 I've tried to train the model using images with 512x512 and 1024x1024 resolution (Celeba-HQ dataset). But I am always getting CUDA out of memory, even with a batch size of 1.

I am using a Quadro P5000 with 16GB. Have you been able to train the model with resolutions higher than 256x?

cpatrickalves avatar Aug 31 '20 21:08 cpatrickalves

When I attempt to train the joint model (mode 4), I get the following runtime error:

start training...



Training epoch: 1
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    main(mode=1)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main
    model.train()
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train
    self.edge_model.backward(e_gen_loss, e_dis_loss)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward
    gen_loss.backward()
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?

When I attempt to train the joint model (mode 4), I get the following runtime error:

start training...



Training epoch: 1
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    main(mode=1)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main
    model.train()
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train
    self.edge_model.backward(e_gen_loss, e_dis_loss)
  File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward
    gen_loss.backward()
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?

Hi, I met the same problem, have you solved it? Thanks。。。。。微信:L811163727。QQ:811163727

liuqi-liuqi avatar Jan 05 '21 11:01 liuqi-liuqi

Have you solved it?I got the same problems 2. looking forward to your reply~

1997Jessie avatar Sep 04 '22 03:09 1997Jessie

Have you solved this problem? Cause I met the same one

EmanAbuelyazeed avatar Jul 30 '23 17:07 EmanAbuelyazeed