CenterMask
CenterMask copied to clipboard
Its really difficult to train out a good mAP when training on my own dataset
Problem Summary
Firstly, I trained on standard COCO dataset used in this paper, and got a good mAP. Secondly, I prepared my own dataset according to COCO format and named them as "train2014"、 "val2014"、 "instances_train2014.json"、"instances_val2014.json". Thirdly, since my own dataset contains only one category --'building', so I changed '_C.MODEL.ROI_BOX_HEAD.NUM_CLASSES'、'_C.MODEL. FCOS.NUM_CLASSES'、'_C.MODEL.RETINANET.NUM_CLASSES' in defaults.py from 81 to 2. And then I trained on my own dataset. But got a bad mAP. It is worth mentioning that I have already visualized my own dataset, and my own dataset performed good in maskrcnn. Thus, I want to ask you if the centermask can be used for other datasets, or if I need to modify any other information when training with my own dataset. [I noticed that there is an issue in FCOS which is similar with this problem: https://github.com/tianzhi0549/FCOS/issues/132, but the issue is also not resolved.]
Environment
GPU: 4 titan xp (12GB)
Versions of relevant libraries:
[pip] numpy==1.16.0
[pip] torch==1.0.0.dev20190328
[pip] torchvision==0.2.2
[conda] pytorch-nightly 1.0.0.dev20190328 py3.7_cuda9.0.176_cudnn7.4.2_0
configs
loss
AP and AR
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.004 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.008 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.018 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.019 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.091 Maximum f-measures for classes: [0.04064810445178543] Score thresholds for classes (used in demos for visualization purposes): [0.016668733209371567] Loading and preparing results... DONE (t=0.19s) creating index... index created! Running per image evaluation... Evaluate annotation type segm DONE (t=13.13s). Accumulating evaluation results... DONE (t=0.23s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.003 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.005 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.003 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.002 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.003 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.013 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.011 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.086 Maximum f-measures for classes: [0.02170795306388527] Score thresholds for classes (used in demos for visualization purposes): [0.33201679587364197] 2020-01-17 04:11:24,843 maskrcnn_benchmark.inference INFO: OrderedDict([('bbox', OrderedDict([('AP', 0.0010722595928040015), ('AP50', 0.003614397825546365), ('AP75', 0.0003211566688604318), ('APs', 0.0016450320259484557), ('APm', 0.0011716715215730757), ('APl', 0.007575648306486348)])), ('segm', OrderedDict([('AP', 0.0027830979629441433), ('AP50', 0.005436005785051814), ('AP75', 0.002610742186121677), ('APs', 0.00037027883278494873), ('APm', 0.002340194956846525), ('APl', 0.01333755135821637)]))])
@JerryIndus Did you visualize the result using demo.py?
I wonder the qualitative result is good or not.
If the visualized result is good, the problem is results from the evaluation step.
@youngwanLEE
Excuse me, I have already visualized the results using demo.py, and you can see some of the results in the figure below.
Some of the test results have error detections and missed detections, but I think the AP and AR values should not be so bad, that's so strange...
At the same time, I noticed a phenomenon: when the program runs to some images, it comes an IndexError:
val_206 processing... val_206 inference time: 0.16s file 83 val_207 processing... val_207 inference time: 0.15s file 84 val_208 processing... Traceback (most recent call last): File "./demo/centermask_demo.py", line 168, in
main() File "./demo/centermask_demo.py", line 158, in main composite = coco_demo.run_on_opencv_image(img) File "/media/wt/DATA/centermask/CenterMask/demo/predictor.py", line 224, in run_on_opencv_image predictions = self.compute_prediction(image) File "/media/wt/DATA/centermask/CenterMask/demo/predictor.py", line 262, in compute_prediction prediction = predictions[0] IndexError: list index out of range
I debug it using one specified image, and you can see when running compute_prediction() function[predictor.py(262)], the predictions is none, and then it led to IndexError. I don't know how to resolve it and whether this problem causes the bad AP or not?
These are some problems and gains I have encountered at present, looking forward to your reply.Thank you very much!
@JerryIndus The qualitative results look good.
I guess the problem results from a custom dataset setting or evaluator.
The above problem is occurs when there are no detection results.
You can simply handle by adding try
~ except
.
@youngwanLEE
I debugged again in the two days, and found that the test data can be read and print correctly. And there is no problem with evaluator. But the predictions obtained by demo.py->predictor.py->compute_prediction() is different from the predictions obtained by inference.py->compute_on_dataset(). Maybe this can explain why use demo.py can gain a good visualized results, but the AP is bad. But I really don't know what caused this phenomenon. After all, they call the same functions and the same weight. So I still want to ask you. Looking forward to your reply.Thank you very much!
illustrate in more detail:
By the way, the problem:list index out of range when running demo.py has already resolved by using try~except, Thanks for the tip.
I ran into a very similar situation. I tried running centermask on my own dataset following steps in the maskrcnn-benchmark to modify the # of classes and trimm the pretrained file. It can train just fine but whenever I tried to run inference or evaluation the code always crashed returning cuda errors (which was quite uninformative that I couldn't even debug on at all) for a certain image. And I also tried simply using try/except clause to ignore the error but the model for some reason just seemed broken hereafter and it simply didn't work.
Also, everything was working nicely w/o changing the # of classes but I don't think this is the best way to do it with unneccesary additional weights left in the heads.
Ok turned out it was because I only changed # of classes in ROI_BOX_HEAD
and didn't changed the number in FCOS.NUM_CLASS
. Didn't realize these were different until I rechecked the default config. Still, it might be pytorch exception handling and error message being terrible that it took me 3 days to spot and fix an should-have-been-obvious matrix size inconsistency.
@JerryIndus I got the same problem with you. All APs are smaller than 0.01. I was wondering some wrong operations have been made. Now I see your post and plan to retrain my model again.
Hey all,
I'm facing this error when I'm trying to train the centermask from scratch. any idea?
my command line:
python -m torch.distributed.launch --nproc_per_node=1 tools/train_net.py --config-file "configs/centermask/centermask_V_19_eSE_FPN_lite_res600_ms_bs16_4x.yaml"
Everything goes fine until here
loading annotations into memory... Done (t=9.23s) creating index... index created! loading annotations into memory... Done (t=0.27s) creating index... index created! 2020-03-17 17:12:30,920 maskrcnn_benchmark.trainer INFO: Start training
After that I'm getting this error: IndexError: list index out of range
Thanks!
@Auth0rM0rgan In my opinion, you had better train your model on multiple GPUs(at least 2 GPUs). As for your error, would you mind providing more info about this? we couldn't imply what you have met only depending on a single line.
@TengFeiHan0 I have tried with 2 GPUs as well but still getting the same error. Here is the log file log.txt generated by the model and then I'm getting this error:
Traceback (most recent call last):
File "tools/train_net.py", line 189, in <module>
main()
File "tools/train_net.py", line 182, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 88, in train
arguments,
File "/home/CenterMask/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataset.py", line 207, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File "/home/CenterMask/maskrcnn_benchmark/data/datasets/coco.py", line 91, in __getitem__
target = target.clip_to_image(remove_empty=True)
File "/home/CenterMask/maskrcnn_benchmark/structures/bounding_box.py", line 224, in clip_to_image
return self[keep]
File "/home/CenterMask/maskrcnn_benchmark/structures/bounding_box.py", line 209, in __getitem__
bbox.add_field(k, v[item])
File "/home/CenterMask/maskrcnn_benchmark/structures/segmentation_mask.py", line 513, in __getitem__
selected_instances = self.instances.__getitem__(item)
File "/home/CenterMask/maskrcnn_benchmark/structures/segmentation_mask.py", line 422, in __getitem__
selected_polygons.append(self.polygons[i])
IndexError: list index out of range
@youngwanLEE , Do you have any idea why I'm getting this error when I want to train the model from scratch? Thanks
@Auth0rM0rgan I remember that I have seen the same error, please check this issue. Anyway, I guess your PyTorch version is not the same as the author suggested. If I'm right, please create a virtual envs to install maskrcnn-benchmark again.
@Auth0rM0rgan after checked your log, I find that your pytorch version is 1.4.0 and torchvision is also the latest. please follow these instructions to set up a new virtual envs. by the way, when executing this line" conda install -c pytorch torchvision=0.2.1 cudatoolkit=9.0
", please ensure one thing that your current Cuda version does not conflict with this Cudatoolkit.