ContrastiveSeg
ContrastiveSeg copied to clipboard
problem with resuming training from checkpoint
Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?
(command used) sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3
(modifications in main.py: ignore single quotes typed in here for proper display) elif [ "$1"x == "resume"x ]; then ${PYTHON} -u main.py --configs '$'{CONFIGS} \ --drop_last y \ --phase train \ --gathered n \ --loss_balance y \ --log_to_file n \ --backbone ${BACKBONE} \ --model_name ${MODEL_NAME} \ --max_iters ${MAX_ITERS} \ --data_dir ${DATA_DIR} \ --loss_type ${LOSS_TYPE} \ --resume_continue y \ --resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \ --checkpoints_name ${CHECKPOINTS_NAME} \ --distributed False \ 2>&1 | tee -a ${LOG_FILE} #--gpu 0 1 2 3 **
2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
File "main.py", line 227, in