Detectron.pytorch icon indicating copy to clipboard operation
Detectron.pytorch copied to clipboard

Combine train and val?

Open QiaoranC opened this issue 7 years ago • 8 comments

Hi, just ask, would you consider to combine the train and val together? Like a standard one: every epoch do the validation before ckpt, if val accuracy/loss is higher then save the ckpt. I know this need some work to be done. But it would be very convenience and easily to start.

Now i am working on it, but i guess i can do the validation after every ckpt is generated during training, (based on your test_net.py, load the ckpt and val it). model.train->model.save->model.eval It would be more efficient if you can prove a example that model.train->model.eval->model.save.

QiaoranC avatar May 05 '18 09:05 QiaoranC

Small reminder: In test_net.py, forgot add

if args.vis:
        cfg.VIS = True

Now, '--vis' is not working because the missing.

QiaoranC avatar May 05 '18 11:05 QiaoranC

About the train/valide update, i straight add run inference from the test into the train after save_capt, and add some parser_argument from the test(--output_dir, --range, -- multi-gpu-testing) and add cfg.TEST.DATASETS = ('coco_2017_val',). To make run inference need all the necessary input.

           ......training....
            training_stats.LogIterStats(step, lr)

            if (step+1) % 21 == 0:
                save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer)

                args.test_net_file, _ = os.path.splitext(__file__)
                ckpt_dir = os.path.join(output_dir, 'ckpt')
                save_name = os.path.join(ckpt_dir, 'model_step{}.pth'.format(step))
                args.load_ckpt = save_name
                run_inference(
                    args,
                    ind_range=args.range,
                    multi_gpu_testing=args.multi_gpu_testing,
                    check_expected_results=True)

But when i run, it got the following error in data_parallel.py

Traceback (most recent call last):
  File "train_net_step.py", line 447, in <module>
    main()
  File "train_net_step.py", line 425, in main
    check_expected_results=True)
  File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 128, in run_inference
    all_results = result_getter()
  File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 108, in result_getter
    multi_gpu=multi_gpu_testing
  File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 158, in test_net_on_dataset
    args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id
  File "/home/ubuntu/Detectron_master/lib/core/test_engine.py", line 253, in test_net
    cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
  File "/home/ubuntu/Detectron_master/lib/core/test.py", line 70, in im_detect_all
    model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, box_proposals)
  File "/home/ubuntu/Detectron_master/lib/core/test.py", line 135, in im_detect_bbox
    return_dict = model(**inputs)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/Detectron_master/lib/nn/parallel/data_parallel.py", line 85, in forward
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
  File "/home/ubuntu/Detectron_master/lib/nn/parallel/data_parallel.py", line 85, in <listcomp>
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range

I used 2gpu, and i run it with

export CUDA_VISIBLE_DEVICES=0,1
python3 train_net_step.py --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --use_tfboard --bs 4 --nw 8

I successfully run train and test individually with 2gpu.

QiaoranC avatar May 05 '18 13:05 QiaoranC

@roytseng-tw Any ideals/suggestions or something i should looking for ?

QiaoranC avatar May 07 '18 02:05 QiaoranC

run_inference only supports 1 image(batch) per GPU. Does that conforms to what you have done ?

roytseng-tw avatar May 07 '18 03:05 roytseng-tw

I am not sure what your mean run_inference by 1 image per gpu,

cause i run test_net.py with 77 test images, each gpu assigned with 39/38 images(based on the logs/prints). And The results are fine.

When first run run_inference , it should lead parent_func=test_net_on_dataset , which final lead the subprocess_utils.process_in_parallel, In the subprocess.py, you run the inference with cmd = ('python {binary} --range {start} {end} --cfg {cfg_file} --set {opts} ', '--output_dir {output_dir}'). which should final lead to test_engine.py, and it support a range imgs input.

def test_net(
        args,
        dataset_name,
        proposal_file,
        output_dir,
        ind_range=None,
        gpu_id=0):
    roidb, dataset, start_ind, end_ind, total_num_images = get_roidb_and_dataset(
        dataset_name, proposal_file, ind_range
    )
    model = initialize_model_from_cfg(args, gpu_id=gpu_id)
    num_images = len(roidb)
    num_classes = cfg.MODEL.NUM_CLASSES
    all_boxes, all_segms, all_keyps = empty_results(num_classes, num_images)
    timers = defaultdict(Timer)
    for i, entry in enumerate(roidb):
        ......
        im = cv2.imread(entry['image'])
        cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
       ......

My ideal is to run the run_inference (in test_net.py level, not subprocess level) after every ckpt is done. And run_inference share the same gpus setting as training, do you think its possibile?

i run it with

export CUDA_VISIBLE_DEVICES=0,1
python3 test_net.py --multi-gpu-testin --dataset coco2017 --cfg configs/e2e_mask_rcnn_R-50-C4_1x.yaml --load_ckpt /home/ubuntu/Detection/Detectron_master/Outputs/e2e_mask_rcnn_R-50-C4_1x/May04-11-28-11_ubuntu16_step/ckpt/model_step19999.pth

i got

......
INFO subprocess.py: 130: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 132: stdout of subprocess 0 with range [1, 39]
INFO subprocess.py: 134: # ---------------------------------------------------------------------------- #
loading annotations into memory...
Done (t=0.07s)
creating index...
index created!
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 1/39 1.828s + 0.123s (eta: 0:01:14)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 11/39 0.587s + 0.125s (eta: 0:00:19)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 21/39 0.528s + 0.129s (eta: 0:00:11)
INFO test_engine.py: 286: im_detect: range [1, 39] of 77: 31/39 0.507s + 0.129s (eta: 0:00:05)
.......
ts/e2e_mask_rcnn_R-50-C4_1x/May04-11-28-11_ubuntu16_step/ckpt/model_step19999.pth
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 40/77 1.921s + 0.150s (eta: 0:01:16)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 50/77 0.597s + 0.130s (eta: 0:00:19)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 60/77 0.533s + 0.132s (eta: 0:00:11)
INFO test_engine.py: 286: im_detect: range [40, 77] of 77: 70/77 0.517s + 0.138s (eta: 0:00:04)
INFO test_engine.py: 321: Wrote detections to: 

INFO subprocess.py: 130: # ---------------------------------------------------------------------------- #
INFO subprocess.py: 132: stdout of subprocess 1 with range [40, 77]
INFO subprocess.py: 134: # ---------------------------------------------------------------------------- #
.......
INFO json_dataset_evaluator.py: 232: ~~~~ Summary metrics ~~~~
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.385
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.643
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.446
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.379
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.394
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.091
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.259
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.467
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.461
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.467

QiaoranC avatar May 07 '18 04:05 QiaoranC

@roytseng-tw Sorry to bother, but any interest in combine train and val? Or leave me some suggests about this? I just through this project is great/smart at multiple gpus, it is a pity that missing train/val integration (feels like not a complete workable model project).

QiaoranC avatar May 11 '18 07:05 QiaoranC

@QiaoranC I implemented this in my fork here. Basically the problem was that you need to change to eval mode before running run_inference.

nadavbh12 avatar Aug 14 '18 11:08 nadavbh12

@nadavbh12 I tried out your modifications as per your 'added logging of validation set' commit but I'm getting this error,

Traceback (most recent call last):
  File "tools/train_net_step2.py", line 520, in <module>
    main()
  File "tools/train_net_step2.py", line 484, in main
    model=maskRCNN)
  File "/home/an1/PANet/lib/core/test_engine.py", line 130, in run_inference
    all_results = result_getter()
  File "/home/an1/PANet/lib/core/test_engine.py", line 109, in result_getter
    model=model
  File "/home/an1/PANet/lib/core/test_engine.py", line 161, in test_net_on_dataset
    args, dataset_name, proposal_file, output_dir, gpu_id=gpu_id, model=model
  File "/home/an1/PANet/lib/core/test_engine.py", line 260, in test_net
    cls_boxes_i, cls_segms_i, cls_keyps_i = im_detect_all(model, im, box_proposals, timers)
  File "/home/an1/PANet/lib/core/test.py", line 68, in im_detect_all
    model, im, box_proposals)
  File "/home/an1/PANet/lib/core/test.py", line 231, in im_detect_bbox_aug
    model, im, scale, max_size, box_proposals
  File "/home/an1/PANet/lib/core/test.py", line 324, in im_detect_bbox_scale
    model, im, target_scale, target_max_size, boxes=box_proposals
  File "/home/an1/PANet/lib/core/test.py", line 152, in im_detect_bbox
    return_dict = model(**inputs)
  File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/an1/PANet/lib/nn/parallel/data_parallel.py", line 82, in forward
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
  File "/home/an1/PANet/lib/nn/parallel/data_parallel.py", line 82, in <listcomp>
    mini_kwargs = dict([(k, v[i]) for k, v in kwargs.items()])
IndexError: list index out of range

Was there anything else to be added?

ashnair1 avatar Jun 09 '19 13:06 ashnair1