CenterMask icon indicating copy to clipboard operation
CenterMask copied to clipboard

Its really difficult to train out a good mAP when training on my own dataset

Open JerryIndus opened this issue 5 years ago • 11 comments

Problem Summary

Firstly, I trained on standard COCO dataset used in this paper, and got a good mAP. Secondly, I prepared my own dataset according to COCO format and named them as "train2014"、 "val2014"、 "instances_train2014.json"、"instances_val2014.json". Thirdly, since my own dataset contains only one category --'building', so I changed '_C.MODEL.ROI_BOX_HEAD.NUM_CLASSES'、'_C.MODEL. FCOS.NUM_CLASSES'、'_C.MODEL.RETINANET.NUM_CLASSES' in defaults.py from 81 to 2. And then I trained on my own dataset. But got a bad mAP. It is worth mentioning that I have already visualized my own dataset, and my own dataset performed good in maskrcnn. Thus, I want to ask you if the centermask can be used for other datasets, or if I need to modify any other information when training with my own dataset. [I noticed that there is an issue in FCOS which is similar with this problem: https://github.com/tianzhi0549/FCOS/issues/132, but the issue is also not resolved.]

Environment

GPU: 4 titan xp (12GB) Versions of relevant libraries: [pip] numpy==1.16.0 [pip] torch==1.0.0.dev20190328 [pip] torchvision==0.2.2 [conda] pytorch-nightly 1.0.0.dev20190328 py3.7_cuda9.0.176_cudnn7.4.2_0 [conda] torchvision 0.2.2 pypi_0 pypi Pillow (6.2.1)

configs

捕获

loss

log_result

AP and AR

Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.004 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.008 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.018 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.019 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.091 Maximum f-measures for classes: [0.04064810445178543] Score thresholds for classes (used in demos for visualization purposes): [0.016668733209371567] Loading and preparing results... DONE (t=0.19s) creating index... index created! Running per image evaluation... Evaluate annotation type segm DONE (t=13.13s). Accumulating evaluation results... DONE (t=0.23s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.003 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.005 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.003 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.086 Maximum f-measures for classes: [0.02170795306388527] Score thresholds for classes (used in demos for visualization purposes): [0.33201679587364197] 2020-01-17 04:11:24,843 maskrcnn_benchmark.inference INFO: OrderedDict([('bbox', OrderedDict([('AP', 0.0010722595928040015), ('AP50', 0.003614397825546365), ('AP75', 0.0003211566688604318), ('APs', 0.0016450320259484557), ('APm', 0.0011716715215730757), ('APl', 0.007575648306486348)])), ('segm', OrderedDict([('AP', 0.0027830979629441433), ('AP50', 0.005436005785051814), ('AP75', 0.002610742186121677), ('APs', 0.00037027883278494873), ('APm', 0.002340194956846525), ('APl', 0.01333755135821637)]))])

JerryIndus avatar Feb 12 '20 14:02 JerryIndus

@JerryIndus Did you visualize the result using demo.py?

I wonder the qualitative result is good or not.

If the visualized result is good, the problem is results from the evaluation step.

youngwanLEE avatar Feb 13 '20 00:02 youngwanLEE

@youngwanLEE Excuse me, I have already visualized the results using demo.py, and you can see some of the results in the figure below. result Some of the test results have error detections and missed detections, but I think the AP and AR values should not be so bad, that's so strange... At the same time, I noticed a phenomenon: when the program runs to some images, it comes an IndexError:

val_206 processing... val_206 inference time: 0.16s file 83 val_207 processing... val_207 inference time: 0.15s file 84 val_208 processing... Traceback (most recent call last): File "./demo/centermask_demo.py", line 168, in main() File "./demo/centermask_demo.py", line 158, in main composite = coco_demo.run_on_opencv_image(img) File "/media/wt/DATA/centermask/CenterMask/demo/predictor.py", line 224, in run_on_opencv_image predictions = self.compute_prediction(image) File "/media/wt/DATA/centermask/CenterMask/demo/predictor.py", line 262, in compute_prediction prediction = predictions[0] IndexError: list index out of range

I debug it using one specified image, and you can see when running compute_prediction() function[predictor.py(262)], the predictions is none, and then it led to IndexError. I don't know how to resolve it and whether this problem causes the bad AP or not? 2 These are some problems and gains I have encountered at present, looking forward to your reply.Thank you very much!

JerryIndus avatar Feb 16 '20 14:02 JerryIndus

@JerryIndus The qualitative results look good.

I guess the problem results from a custom dataset setting or evaluator.

The above problem is occurs when there are no detection results.

You can simply handle by adding try ~ except .

youngwanLEE avatar Feb 18 '20 08:02 youngwanLEE

@youngwanLEE I debugged again in the two days, and found that the test data can be read and print correctly. And there is no problem with evaluator. But the predictions obtained by demo.py->predictor.py->compute_prediction() is different from the predictions obtained by inference.py->compute_on_dataset(). Maybe this can explain why use demo.py can gain a good visualized results, but the AP is bad. But I really don't know what caused this phenomenon. After all, they call the same functions and the same weight. So I still want to ask you. Looking forward to your reply.Thank you very much! illustrate in more detail: 2020-02-27 20-38-49屏幕截图 By the way, the problem:list index out of range when running demo.py has already resolved by using try~except, Thanks for the tip.

JerryIndus avatar Feb 27 '20 12:02 JerryIndus

I ran into a very similar situation. I tried running centermask on my own dataset following steps in the maskrcnn-benchmark to modify the # of classes and trimm the pretrained file. It can train just fine but whenever I tried to run inference or evaluation the code always crashed returning cuda errors (which was quite uninformative that I couldn't even debug on at all) for a certain image. And I also tried simply using try/except clause to ignore the error but the model for some reason just seemed broken hereafter and it simply didn't work.

Also, everything was working nicely w/o changing the # of classes but I don't think this is the best way to do it with unneccesary additional weights left in the heads.

Ok turned out it was because I only changed # of classes in ROI_BOX_HEAD and didn't changed the number in FCOS.NUM_CLASS. Didn't realize these were different until I rechecked the default config. Still, it might be pytorch exception handling and error message being terrible that it took me 3 days to spot and fix an should-have-been-obvious matrix size inconsistency.

hakillha avatar Mar 11 '20 03:03 hakillha

@JerryIndus I got the same problem with you. All APs are smaller than 0.01. I was wondering some wrong operations have been made. Now I see your post and plan to retrain my model again.

TengFeiHan0 avatar Mar 12 '20 08:03 TengFeiHan0

Hey all,

I'm facing this error when I'm trying to train the centermask from scratch. any idea?

my command line:

python -m torch.distributed.launch --nproc_per_node=1 tools/train_net.py --config-file "configs/centermask/centermask_V_19_eSE_FPN_lite_res600_ms_bs16_4x.yaml"

Everything goes fine until here loading annotations into memory... Done (t=9.23s) creating index... index created! loading annotations into memory... Done (t=0.27s) creating index... index created! 2020-03-17 17:12:30,920 maskrcnn_benchmark.trainer INFO: Start training

After that I'm getting this error: IndexError: list index out of range

Thanks!

Auth0rM0rgan avatar Mar 17 '20 16:03 Auth0rM0rgan

@Auth0rM0rgan In my opinion, you had better train your model on multiple GPUs(at least 2 GPUs). As for your error, would you mind providing more info about this? we couldn't imply what you have met only depending on a single line.

TengFeiHan0 avatar Mar 18 '20 06:03 TengFeiHan0

@TengFeiHan0 I have tried with 2 GPUs as well but still getting the same error. Here is the log file log.txt generated by the model and then I'm getting this error:

Traceback (most recent call last):
  File "tools/train_net.py", line 189, in <module>
    main()
  File "tools/train_net.py", line 182, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 88, in train
    arguments,
  File "/home/CenterMask/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 207, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/home/CenterMask/maskrcnn_benchmark/data/datasets/coco.py", line 91, in __getitem__
    target = target.clip_to_image(remove_empty=True)
  File "/home/CenterMask/maskrcnn_benchmark/structures/bounding_box.py", line 224, in clip_to_image
    return self[keep]
  File "/home/CenterMask/maskrcnn_benchmark/structures/bounding_box.py", line 209, in __getitem__
    bbox.add_field(k, v[item])
  File "/home/CenterMask/maskrcnn_benchmark/structures/segmentation_mask.py", line 513, in __getitem__
    selected_instances = self.instances.__getitem__(item)
  File "/home/CenterMask/maskrcnn_benchmark/structures/segmentation_mask.py", line 422, in __getitem__
    selected_polygons.append(self.polygons[i])
IndexError: list index out of range

@youngwanLEE , Do you have any idea why I'm getting this error when I want to train the model from scratch? Thanks

Auth0rM0rgan avatar Mar 18 '20 12:03 Auth0rM0rgan

@Auth0rM0rgan I remember that I have seen the same error, please check this issue. Anyway, I guess your PyTorch version is not the same as the author suggested. If I'm right, please create a virtual envs to install maskrcnn-benchmark again.

TengFeiHan0 avatar Mar 19 '20 02:03 TengFeiHan0

@Auth0rM0rgan after checked your log, I find that your pytorch version is 1.4.0 and torchvision is also the latest. please follow these instructions to set up a new virtual envs. by the way, when executing this line" conda install -c pytorch torchvision=0.2.1 cudatoolkit=9.0", please ensure one thing that your current Cuda version does not conflict with this Cudatoolkit.

TengFeiHan0 avatar Mar 19 '20 02:03 TengFeiHan0