ContrastiveSeg icon indicating copy to clipboard operation
ContrastiveSeg copied to clipboard

problem with resuming training from checkpoint

Open mailtohrishi opened this issue 2 years ago • 0 comments

Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?

(command used) sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3

(modifications in main.py: ignore single quotes typed in here for proper display) elif [ "$1"x == "resume"x ]; then ${PYTHON} -u main.py --configs '$'{CONFIGS} \ --drop_last y \ --phase train \ --gathered n \ --loss_balance y \ --log_to_file n \ --backbone ${BACKBONE} \ --model_name ${MODEL_NAME} \ --max_iters ${MAX_ITERS} \ --data_dir ${DATA_DIR} \ --loss_type ${LOSS_TYPE} \ --resume_continue y \ --resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \ --checkpoints_name ${CHECKPOINTS_NAME} \ --distributed False \ 2>&1 | tee -a ${LOG_FILE} #--gpu 0 1 2 3 **

2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth... 2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None 2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9 2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train... 2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501 2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ... 2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126 2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss. 2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss 2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img'] 2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap'] Traceback (most recent call last): File "main.py", line 227, in model.train() File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train self.__train() File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train backward_loss = display_loss = self.pixel_loss(outputs, targets, File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward return self.module(inputs[0], *targets[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward seg_loss = self.ce_loss(seg_out, targets) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3))) IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

mailtohrishi avatar Nov 16 '22 12:11 mailtohrishi