camera-relocalisation icon indicating copy to clipboard operation
camera-relocalisation copied to clipboard

HI,imelekhov.I HAVE MEET SOME TRAIN PROBLEM FOR input_ and gradOutput_ shapes do not match

Open TheBloodthirster opened this issue 5 years ago • 0 comments

When i want to train the net for : th main.lua -weights <path/to/downloaded_weights/model_snapshot_7scenes.t7> -dataset_src_path </path/to/7Scenes> without -do_evaluation I have meet some problem.

Here is error:

{ val_batch_size : 40 beta1 : 0.9 do_evaluation : false use_dropout : false dataset_src_path : "/data/code/camera-relocalisation/7Scenes" gamma : 0.001 image_size : 224 epoch_number : 1 weights : "/data/code/camera-relocalisation/downloaded_weights/model_snapshot_7scenes.t7" train_batch_size : 64 validation_dataset_size : 10402 max_epoch : 250 dataset_name : "7-Scenes" nGPU : 1 momentum : 0.9 logs : "./logs/7scenes.log" beta : 1 manualSeed : 333 learning_rate : 0.1 beta2 : 0.999 model_zoo_path : "./pretrained_models" precomputed_data_path : "./data" results_filename : "./results/7scenes_res.bin" snapshot_dir : "./snapshots" GPU : 1 weight_decay : 1e-05 power : 0.5 training_dataset_size : 39999 } this is a test for load_training_data ==> Training GT labels have been loaded successfully ==> Validation GT labels have been loaded successfully ==> loading model from pretained weights from file: /data/code/camera-relocalisation/downloaded_weights/model_snapshot_7scenes.t7 ==> configuring optimizer ==> number of batches: 624 ==> learning rate: 0.1 ==> Number of parameters in the model: 22350215 ==> online epoch # 1 [batchSize = 64] ==> time taken to randomize input training data: 2.7921199798584 ms /torch/install/bin/luajit: /torch/install/share/lua/5.1/nn/Container.lua:67: ...........] ETA: 0ms | Step: 0ms In 1 module of nn.Sequential: In 1 module of nn.ParallelTable: In 2 module of nn.Sequential: /torch/install/share/lua/5.1/nn/THNN.lua:110: input_ and gradOutput_ shapes do not match: input_ [2 x 64 x 112 x 112], gradOutput_ [64 x 64 x 112 x 112] at /torch/extra/cunn/lib/THCUNN/generic/BatchNormalization.cu:74 stack traceback: [C]: in function 'v' /torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_backward' /torch/install/share/lua/5.1/nn/BatchNormalization.lua:154: in function </torch/install/share/lua/5.1/nn/BatchNormalization.lua:140> [C]: in function 'xpcall' /torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /torch/install/share/lua/5.1/nn/Sequential.lua:70: in function </torch/install/share/lua/5.1/nn/Sequential.lua:63> [C]: in function 'xpcall' /torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /torch/install/share/lua/5.1/nn/ParallelTable.lua:27: in function 'accGradParameters' /torch/install/share/lua/5.1/nn/Module.lua:32: in function </torch/install/share/lua/5.1/nn/Module.lua:29> [C]: in function 'xpcall' /torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward' /data/code/camera-relocalisation/cnn_part/train.lua:68: in function 'opfunc' /torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' /data/code/camera-relocalisation/cnn_part/train.lua:72: in function 'train' main.lua:97: in main chunk [C]: in function 'dofile' /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward' /data/code/camera-relocalisation/cnn_part/train.lua:68: in function 'opfunc' /torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam' /data/code/camera-relocalisation/cnn_part/train.lua:72: in function 'train' main.lua:97: in main chunk [C]: in function 'dofile' /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

and i think the problem in local in here: for t,v in ipairs(indices) do xlua.progress(t, #indices)

    local mini_batch_info = make_training_minibatch(v)
    local mini_batch_data = mini_batch_info.data:cuda()
    local orientation_gt = mini_batch_info.quaternion_labels:cuda()
    local translation_gt = mini_batch_info.translation_labels:cuda()
    
    cutorch.synchronize()
    collectgarbage()
    
    feval = function(x)
        if x ~= parameters then parameters:copy(x) end
        model:zeroGradParameters()

        local outputs = model:forward({mini_batch_data[{{}, 1, {}, {}, {}}], mini_batch_data[{{}, 2, {}, {}, {}}]})
        local err = criterion:forward(outputs, {translation_gt, orientation_gt})
        meter_train_t:add(criterion.weights[1] * criterion.criterions[1].output)
        meter_train_q:add(criterion.weights[2] * criterion.criterions[2].output)
        
        local df_do = criterion:backward(outputs, {translation_gt, orientation_gt})
        model:backward(mini_batch_data, df_do)
        
        return err, gradParameters
    end
    optim.adam(feval, parameters, optimState)

============================================ especial when i note optim.adam(feval, parameters, optimState) ,the training can work well.

i don't know what's going on,could you please help me ? THANKS ADVANCED!

TheBloodthirster avatar Aug 04 '19 10:08 TheBloodthirster