edge-connect
edge-connect copied to clipboard
Runtime error when training joint model
When I attempt to train the joint model (mode 4), I get the following runtime error:
start training...
Training epoch: 1
Traceback (most recent call last):
File "train.py", line 2, in <module>
main(mode=1)
File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main
model.train()
File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train
self.edge_model.backward(e_gen_loss, e_dis_loss)
File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward
gen_loss.backward()
File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?
modify dis_loss.backward(retain_graph = True) in models.py , line 256.
I only get past this error if I modify gen_loss.backward(retain_graph=True), but this causes a CUDA out of memory error on a Titan Xp GPU. Do I need significantly more than 11GB to train the joint model?
Hi, I met the same problem, have you solved it? Thanks!
@ShnitzelKiller and @xiaoj45 I've tried to train the model using images with 512x512 and 1024x1024 resolution (Celeba-HQ dataset). But I am always getting CUDA out of memory, even with a batch size of 1.
I am using a Quadro P5000 with 16GB. Have you been able to train the model with resolutions higher than 256x?
When I attempt to train the joint model (mode 4), I get the following runtime error:
start training... Training epoch: 1 Traceback (most recent call last): File "train.py", line 2, in <module> main(mode=1) File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main model.train() File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train self.edge_model.backward(e_gen_loss, e_dis_loss) File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward gen_loss.backward() File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?
When I attempt to train the joint model (mode 4), I get the following runtime error:
start training... Training epoch: 1 Traceback (most recent call last): File "train.py", line 2, in <module> main(mode=1) File "/projects/grail/jamesn8/projects/inpainting/edge-connect/main.py", line 56, in main model.train() File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/edge_connect.py", line 179, in train self.edge_model.backward(e_gen_loss, e_dis_loss) File "/projects/grail/jamesn8/projects/inpainting/edge-connect/src/models.py", line 148, in backward gen_loss.backward() File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/local1/jamesn8/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I am able to train the edge model, inpaint model, and edge-inpaint model. And in the paper, your results were obtained using the joint model, with both networks being updated, correct?
Hi, I met the same problem, have you solved it? Thanks。。。。。微信:L811163727。QQ:811163727
Have you solved it?I got the same problems 2. looking forward to your reply~
Have you solved this problem? Cause I met the same one