NTIRE2017 problem about training own dataset

hello, I find some problems when I train the model with my own dataset. Below are some description about my issue, hoping you can give me some suggestion about this problem, thank you very much ! 2019-02-20_20-27-28

loading model and criterion... Loading pre-trained model from: ../demo/model/EDSR_x4.t7 Creating data loader... loading data... Initializing data loader for train set... Initializing data loader for val set... Train start /home/luomeilu/torch/install/bin/luajit: ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: /var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory stack traceback: [C]: in function 'error' /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: in function 'load' ./data/div2k.lua:122: in function 'get' ./dataloader.lua:89: in function <./dataloader.lua:76> [C]: in function 'xpcall' ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' ...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' ...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:15: in main chunk stack traceback: [C]: in function 'error' ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' ./dataloader.lua:158: in function '(for generator)' ./train.lua:69: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670

Feb 20 '19 12:02 meroluo

/var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory

This line indicates that the training image is not located correctly, or you didn't specify the location of your dataset. You should first define your own dataset parser, e.g., /data/code/yourdataset.lua, and update some other codes such as /code/opts.lua

Feb 21 '19 02:02 limbee

Thank you very much! Now I have put the dataset in the correct location.When I train the model, the setting of patch size is 256, error occurs as below: 2019-02-22_14-36-08

loading model and criterion... Loading pre-trained model from: ../demo/model/EDSR_x2.t7 Load pre-trained SRResnet and change upsampler Changing upsample layers Creating data loader... loading data... Initializing data loader for train set... Initializing data loader for val set... Train start THCudaCheck FAIL file=/home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/luomeilu/torch/install/bin/luajit: /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: In 3 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 29 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 3 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66 stack traceback: [C]: in function 'resizeAs' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput' /home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function <.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:30> [C]: in function 'xpcall' ... /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670

SO I wonder if I can set the size of patch smaller. if so, will it affect the effect of the training?

Feb 22 '19 06:02 meroluo

The number of channels is set to 256, and patch size is 96. https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/training.sh#L1-L2

This setting is suited for GPUs with 12GB of memory, so other GPUs with less than 12GB will probably give you an OOM error. You can change batch size or patch size using options https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/opts.lua#L49 https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/opts.lua#L51

Reducing the patch size may affect the final performance.

Feb 24 '19 04:02 limbee

Thank you for your suggestion! In the training, the scale is set to 4, I will try to change the batch size or patch size in the option.py

Feb 27 '19 02:02 meroluo

NTIRE2017 NTIRE2017 copied to clipboard

problem about training own dataset

NTIRE2017
NTIRE2017 copied to clipboard