NTIRE2017
NTIRE2017 copied to clipboard
Segmentation fault when training baseline model
I got message below when training baseline model:
... [Iter: 299.1k / lr: 5.00e-5] Time: 66.29 (Data: 61.42) Err: 3.234126 [Iter: 299.2k / lr: 5.00e-5] Time: 65.32 (Data: 60.11) Err: 3.496183 [Iter: 299.3k / lr: 5.00e-5] Time: 66.40 (Data: 61.23) Err: 3.399313 [Iter: 299.4k / lr: 5.00e-5] Time: 64.99 (Data: 60.01) Err: 3.379927 [Iter: 299.5k / lr: 5.00e-5] Time: 65.95 (Data: 60.72) Err: 3.503887 [Iter: 299.6k / lr: 5.00e-5] Time: 66.23 (Data: 61.05) Err: 3.338660 [Iter: 299.7k / lr: 5.00e-5] Time: 65.30 (Data: 59.97) Err: 3.448611 [Iter: 299.8k / lr: 5.00e-5] Time: 65.69 (Data: 60.95) Err: 3.330575 [Iter: 299.9k / lr: 5.00e-5] Time: 66.04 (Data: 61.20) Err: 3.350167 [Iter: 300.0k / lr: 5.00e-5] Time: 65.34 (Data: 59.59) Err: 3.413485 [Epoch 300 (iter/epoch: 1000)] Test time: 25.48 (scale 2) Average PSNR: 35.5833 (Highest ever: 35.5902 at epoch = 288)
Segmentation fault (core dumped)
I'm not sure it the training process is successfully completed or not. If it is, where is the trained model?
I'm not sure why it prints the segmentation fault message, but the experiment is done successfully. Trained models are saved at experiment/
I'm trying next step of training (item 1 in training.sh) th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 but seeing out of memory as below. I've tried othee chopSize such as: th main.lua -scale 2 -nFeat 256 -nResBlock 36 -patchSize 96 -scaleRes 0.1 -skipBatch 3 -chopSize 16e0 but the situation remains the same. How small can chopSize be set? Or is there any other options I can try?
loading model and criterion... Creating model from file: models/baseline.lua Creating data loader... loading data... Initializing data loader for train set... Initializing data loader for val set... Train start THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/onegin/torch/install/bin/luajit: /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: In 3 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 22 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 1 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-9315/cutorch/lib/THC/generic/THCStorage.cu:66 stack traceback: [C]: in function 'resizeAs' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput' /home/onegin/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/onegin/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function </home/onegin/torch/install/share/lua/5.1/nn/ConcatTable.lua:30> [C]: in function 'xpcall' ... /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function </home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/onegin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /home/onegin/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...egin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50
Try nResBlock=32 instead of 36, if you're using TitanX. We used 32 residual blocks when writing a paper since sometimes 12GB of GPU memory is not enough for 36 resblocks.