Detectron.pytorch
Detectron.pytorch copied to clipboard
Combine train and val?
Hi, just ask, would you consider to combine the train and val together? Like a standard one: every epoch do the validation before ckpt, if val accuracy/loss is higher then save the ckpt. I know this need some work to be done. But it would be very convenience and easily to start.
Now i am working on it, but i guess i can do the validation after every ckpt is generated during training, (based on your test_net.py, load the ckpt and val it). model.train->model.save->model.eval It would be more efficient if you can prove a example that model.train->model.eval->model.save.
Small reminder: In test_net.py, forgot add
if args.vis:
cfg.VIS = True
Now, '--vis' is not working because the missing.
About the train/valide update, i straight add run inference from the test into the train after save_capt, and add some parser_argument from the test(--output_dir, --range, -- multi-gpu-testing)
and add cfg.TEST.DATASETS = ('coco_2017_val',). To make run inference need all the necessary input.
......training....
training_stats.LogIterStats(step, lr)
if (step+1) % 21 == 0:
save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer)
args.test_net_file, _ = os.path.splitext(__file__)
ckpt_dir = os.path.join(output_dir, 'ckpt')
save_name = os.path.join(ckpt_dir, 'model_step{}.pth'.format(step))
args.load_ckpt = save_name
run_inference(
args,
ind_range=args.range,
multi_gpu_testing=args.multi_gpu_testing,
check_expected_results=True)
But when i run, it got the following error in data_parallel.py
Traceback (most recent call last):
File "train_net_step.py", line 447, in <module>
main()
File "train_net_step.py", line 425, in main
check_expected_results=True)
File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 128, in run_inference
all_results = result_getter()
File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 108, in result_getter
multi_gpu=multi_gpu_testing
File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 158, in test_net_on_dataset
args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 253, in test_net
cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
File "/home/ubuntu/Detectron_master/lib/core/test.py", line 70, in im_detect_all
model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
File "/home/ubuntu/Detectron_master/lib/core/test.py", line 135, in im_detect_bbox
return_dict = model(**inputs)
File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/Detectron_master/lib/nn/parallel/data_parallel.py", line 85, in forward
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
File "/home/ubuntu/Detectron_master/lib/nn/parallel/data_parallel.py", line 85, in <listcomp>
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range
I used 2gpu, and i run it with
export CUDA_VISIBLE_DEVICES=0,1
python3 train_net_step.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --use_tfboard --bs 4 --nw 8
I successfully run train and test individually with 2gpu.
@roytseng-tw Any ideals/suggestions or something i should looking for ?
run_inference only supports 1 image(batch) per GPU. Does that conforms to what you have done ?
I am not sure what your mean run_inference by 1 image per gpu,
cause i run test_net.py with 77 test images, each gpu assigned with 39/38 images(based on the logs/prints). And The results are fine.
When first run run_inference , it should lead parent_func=test_net_on_dataset , which final lead the subprocess_utils.process_in_parallel, In the subprocess.py, you run the inference with cmd = ('python {binary} --range {start} {end} --cfg {cfg_file} --set {opts} ', '--output_dir {output_dir}').
which should final lead to test_engine.py, and it support a range imgs input.
def test_net(
args,
dataset_name,
proposal_file,
output_dir,
ind_range=None,
gpu_id=0):
roidb, dataset, start_ind, end_ind, total_num_images = get_roidb_and_dataset(
dataset_name, proposal_file, ind_range
)
model = initialize_model_from_cfg(args, gpu_id=gpu_id)
num_images = len(roidb)
num_classes = cfg.MODEL.NUM_CLASSES
all_boxes, all_segms, all_keyps = empty_results(num_classes, num_images)
timers = defaultdict(Timer)
for i, entry in enumerate(roidb):
......
im = cv2.imread(entry['image'])
cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
......
My ideal is to run the run_inference (in test_net.py level, not subprocess level) after every ckpt is done. And run_inference share the same gpus setting as training, do you think its possibile?
i run it with
export CUDA_VISIBLE_DEVICES=0,1
python3 test_net.py --multi-gpu-testin --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt /home/ubuntu/Detection/Detectron_master/Outputs/e2e_mask_rcnn_R-50-C4_1x/May04-11-28-11_ubuntu16_step/ckpt/model_step19999.pth
i got
......
INFO subprocess.py: 130: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 132: stdout of subprocess 0 with range [1, 39]
INFO subprocess.py: 134: # ---------------------------------------------------------------------------- #
loading annotations into memory...
Done (t=0.07s)
creating index...
index created!
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 1/39 1.828s + 0.123s (eta: 0:01:14)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 11/39 0.587s + 0.125s (eta: 0:00:19)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 21/39 0.528s + 0.129s (eta: 0:00:11)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 31/39 0.507s + 0.129s (eta: 0:00:05)
.......
ts/e2e_mask_rcnn_R-50-C4_1x/May04-11-28-11_ubuntu16_step/ckpt/model_step19999.pth
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 40/77 1.921s + 0.150s (eta: 0:01:16)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 50/77 0.597s + 0.130s (eta: 0:00:19)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 60/77 0.533s + 0.132s (eta: 0:00:11)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 70/77 0.517s + 0.138s (eta: 0:00:04)
INFO test_engine.py: 321: Wrote detections to:
INFO subprocess.py: 130: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 132: stdout of subprocess 1 with range [40, 77]
INFO subprocess.py: 134: # ---------------------------------------------------------------------------- #
.......
INFO json_dataset_evaluator.py: 232: ~~~~ Summary metrics ~~~~
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.385
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.643
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.446
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.379
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.394
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.091
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.259
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.467
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.461
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.467
@roytseng-tw Sorry to bother, but any interest in combine train and val? Or leave me some suggests about this? I just through this project is great/smart at multiple gpus, it is a pity that missing train/val integration (feels like not a complete workable model project).
@QiaoranC I implemented this in my fork here. Basically the problem was that you need to change to eval mode before running run_inference.
@nadavbh12 I tried out your modifications as per your 'added logging of validation set' commit but I'm getting this error,
Traceback (most recent call last):
File "tools/train_net_step2.py", line 520, in <module>
main()
File "tools/train_net_step2.py", line 484, in main
model=maskRCNN)
File "/home/an1/PANet/lib/core/test_engine.py", line 130, in run_inference
all_results = result_getter()
File "/home/an1/PANet/lib/core/test_engine.py", line 109, in result_getter
model=model
File "/home/an1/PANet/lib/core/test_engine.py", line 161, in test_net_on_dataset
args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id, model=model
File "/home/an1/PANet/lib/core/test_engine.py", line 260, in test_net
cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
File "/home/an1/PANet/lib/core/test.py", line 68, in im_detect_all
model, im, box_proposals)
File "/home/an1/PANet/lib/core/test.py", line 231, in im_detect_bbox_aug
model, im, scale, max_size, box_proposals
File "/home/an1/PANet/lib/core/test.py", line 324, in im_detect_bbox_scale
model, im, target_scale, target_max_size, boxes=box_proposals
File "/home/an1/PANet/lib/core/test.py", line 152, in im_detect_bbox
return_dict = model(**inputs)
File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/an1/PANet/lib/nn/parallel/data_parallel.py", line 82, in forward
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
File "/home/an1/PANet/lib/nn/parallel/data_parallel.py", line 82, in <listcomp>
mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range
Was there anything else to be added?