NTIRE2017
NTIRE2017 copied to clipboard
problem about training own dataset
hello, I find some problems when I train the model with my own dataset. Below are some description about my issue, hoping you can give me some suggestion about this problem, thank you very much ! 2019-02-20_20-27-28
loading model and criterion... Loading pre-trained model from: ../demo/model/EDSR_x4.t7 Creating data loader... loading data... Initializing data loader for train set... Initializing data loader for val set... Train start /home/luomeilu/torch/install/bin/luajit: ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: /var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory stack traceback: [C]: in function 'error' /home/luomeilu/torch/install/share/lua/5.1/image/init.lua:367: in function 'load' ./data/div2k.lua:122: in function 'get' ./dataloader.lua:89: in function <./dataloader.lua:76> [C]: in function 'xpcall' ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' ...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' ...e/luomeilu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:15: in main chunk stack traceback: [C]: in function 'error' ...luomeilu/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob' ./dataloader.lua:158: in function '(for generator)' ./train.lua:69: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670
/var/tmp/dataset/DIV2K/DIV2K_train_LR_bicubic/X4/0045x4.png: No such file or directory
This line indicates that the training image is not located correctly, or you didn't specify the location of your dataset. You should first define your own dataset parser, e.g., /data/code/yourdataset.lua, and update some other codes such as /code/opts.lua
Thank you very much! Now I have put the dataset in the correct location.When I train the model, the setting of patch size is 256, error occurs as below: 2019-02-22_14-36-08
loading model and criterion... Loading pre-trained model from: ../demo/model/EDSR_x2.t7 Load pre-trained SRResnet and change upsampler Changing upsample layers Creating data loader... loading data... Initializing data loader for train set... Initializing data loader for val set... Train start THCudaCheck FAIL file=/home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/luomeilu/torch/install/bin/luajit: /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: In 3 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 29 module of nn.Sequential: In 1 module of nn.Sequential: In 1 module of nn.ConcatTable: In 3 module of nn.Sequential: ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: cuda runtime error (2) : out of memory at /home/luomeilu/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:66 stack traceback: [C]: in function 'resizeAs' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:216: in function 'updateGradInput' /home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/luomeilu/torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:66: in function <.../luomeilu/torch/install/share/lua/5.1/nn/ConcatTable.lua:30> [C]: in function 'xpcall' ... /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:78> [C]: in function 'xpcall' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/luomeilu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' ...e/luomeilu/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward' ./train.lua:89: in function 'train' main.lua:33: in main chunk [C]: in function 'dofile' ...eilu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00406670
SO I wonder if I can set the size of patch smaller. if so, will it affect the effect of the training?
The number of channels is set to 256, and patch size is 96. https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/training.sh#L1-L2
This setting is suited for GPUs with 12GB of memory, so other GPUs with less than 12GB will probably give you an OOM error. You can change batch size or patch size using options https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/opts.lua#L49 https://github.com/limbee/NTIRE2017/blob/db34606c2844e89317aac8728a2de562ef1f8aba/code/opts.lua#L51
Reducing the patch size may affect the final performance.
Thank you for your suggestion! In the training, the scale is set to 4, I will try to change the batch size or patch size in the option.py