pix2pix icon indicating copy to clipboard operation
pix2pix copied to clipboard

Problem with loading saved models

Open venkateshmalgireddy opened this issue 8 years ago • 4 comments

Hi, When I am using continue_train=1 option, The models are getting loaded and transferred into GPU, but was getting the following error at the createRealFake function. It is showing some error in Dropout layer. output: Dataset Size: 2500 loading previously trained netG... loading previously trained netD... nn.gModule nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output] (1): cudnn.SpatialConvolution(6 -> 64, 4x4, 2,2, 1,1) (2): nn.LeakyReLU(0.2) (3): cudnn.SpatialConvolution(64 -> 128, 4x4, 2,2, 1,1) (4): cudnn.SpatialBatchNormalization (5): nn.LeakyReLU(0.2) (6): cudnn.SpatialConvolution(128 -> 256, 4x4, 2,2, 1,1) (7): cudnn.SpatialBatchNormalization (8): nn.LeakyReLU(0.2) (9): cudnn.SpatialConvolution(256 -> 512, 4x4, 1,1, 1,1) (10): cudnn.SpatialBatchNormalization (11): nn.LeakyReLU(0.2) (12): cudnn.SpatialConvolution(512 -> 1, 4x4, 1,1, 1,1) (13): cudnn.Sigmoid } transferring to gpu... done /home/ipcv/torch/install/bin/luajit: /home/ipcv/torch/install/share/lua/5.1/nn/Dropout.lua:26: Creating MTGP constants failed. at /tmp/luarocks_cutorch-scm-1-1283/cutorch/lib/THC/THCTensorRandom.cu:33 stack traceback: [C]: in function 'bernoulli' /home/ipcv/torch/install/share/lua/5.1/nn/Dropout.lua:26: in function 'func' /home/ipcv/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval' /home/ipcv/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward' train.lua:221: in function 'createRealFake' train.lua:325: in main chunk [C]: in function 'dofile' ...ipcv/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670

venkateshmalgireddy avatar Jan 21 '17 10:01 venkateshmalgireddy

I can confirm that I occasionally get the same "Creating MTGP constants failed" error.

Quasimondo avatar Jan 23 '17 14:01 Quasimondo

I got the same error. I had access to two GPUs. It continued training again (loading from the latest checkpoint) when I switched the gpu using the gpu=gpu-no option. Not sure about the cause of the error but the original GPU was working fine on other tasks and had enough memory available.

VasanthBalakrishnan avatar Feb 17 '17 21:02 VasanthBalakrishnan

Ah - I also have two GPUs on my machine, so there might be a pattern. Since I have resorted to adding CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1 at the beginning of the th train.lua line to force pix2pix training a model on a single GPU I have not seen that error anymore.

Quasimondo avatar Feb 18 '17 09:02 Quasimondo

I can confirm that both the above workarounds removed the error and continued training again in my case. Not sure if I should close this discussion, since original problem is not solved.

venkateshmalgireddy avatar Feb 27 '17 09:02 venkateshmalgireddy