mmocr icon indicating copy to clipboard operation
mmocr copied to clipboard

[DBNET]-Bad performance on long text detection

Open fatfishZhao opened this issue 2 years ago • 13 comments

Hi, thanks for your great job. I'm using R50DCN dbnet for chinese text detection. I used about 10k pictures for training based on the pretrain model. When testing, long text cannot be detected, some examples are in the bottom. Can you give me some explanation of this performance? How can I fix this problem? image image image image

fatfishZhao avatar Jul 19 '21 11:07 fatfishZhao

Same Issue. What's the problem of the origin config?

0xCreo avatar Oct 09 '21 01:10 0xCreo

Solved. There are 2 problems.

The first one is the original shrink and unclip methods in paper, which is not suitable for long text ( the unclipped box is thinner than ground truth), so I changed these methods by my understanding.

The second one is a bug in the https://github.com/open-mmlab/mmocr/blob/80741e147913d84bd148fed87320999ecdd89139/mmocr/models/textdet/postprocess/wrapper.py#L215 0.01 is too big for long texts. The program will get two points rather than 4 points under this setting. Set this value to 0.002 is much better.

fatfishZhao avatar Nov 01 '21 09:11 fatfishZhao

Thanks for sharing your solution. Does the final performance look all good?

gaotongxiao avatar Nov 01 '21 09:11 gaotongxiao

Yes, much better.

fatfishZhao avatar Nov 01 '21 10:11 fatfishZhao

Actually, I'm confused about PPOCR's results. They also use dbnet and the presentation pictures in the repo is pretty good on long texts. The only different I found in the code between mmocr and ppocr is they use bigger unclip ratio.

fatfishZhao avatar Nov 01 '21 10:11 fatfishZhao

It's probably because PPOCR uses much more private training data...

gaotongxiao avatar Nov 01 '21 11:11 gaotongxiao

I wrote a blog about this issue, if anyone is interested in this issue, check it out link

Sanster avatar Nov 03 '21 13:11 Sanster

Here is my replacement for shrink and unclip method. Using A and L to calculate the font size, and set the shrink/unclip distance to be a fixed ratio of the font size. image Note: r is same when shrinking and unclipping.

fatfishZhao avatar Nov 05 '21 04:11 fatfishZhao

Here is my replacement for shrink and unclip method. Using A and L to calculate the font size, and set the shrink/unclip distance to be a fixed ratio of the font size. image Note: r is same when shrinking and unclipping.

I write a simple test for your formula, but it doesn't work well in different h/w ratio. Did I make a mistake here? 1

0xCreo avatar Nov 05 '21 07:11 0xCreo

I think so. It works well in my project.

fatfishZhao avatar Nov 08 '21 04:11 fatfishZhao

I think so. It works well in my project.

Hello, Thank you for sharing your method. I tried in my project, it works well for long text, but I found it unclip too much on short text. Do you have same problem? How you fixed it?

viviayi avatar Apr 11 '22 07:04 viviayi

I think so. It works well in my project.

Hello, Thank you for sharing your method. I tried in my project, it works well for long text, but I found it unclip too much on short text. Do you have same problem? How you fixed it?

Hi, what is your r setting, you can try smaller r than 0.4 in paper, like 0.2.

fatfishZhao avatar May 07 '22 03:05 fatfishZhao

@Sanster @fatfishZhao @viviayi @gaotongxiao # Bad performance on long text detection, image image image

it is my config model = dict( type='DBNet', backbone=dict( type='mmdet.ResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), frozen_stages=-1, norm_cfg=dict(type='BN', requires_grad=True), norm_eval=False, style='pytorch', dcn=dict(type='DCNv2', deform_groups=1, fallback_on_stride=False), init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50'), stage_with_dcn=(False, True, True, True)), neck=dict( type='FPNC', in_channels=[256, 512, 1024, 2048], lateral_channels=256, asf_cfg=dict(attention_type='ScaleChannelSpatial')), det_head=dict( type='DBHead', in_channels=256, module_loss=dict(type='DBModuleLoss'), postprocessor=dict( type='DBPostprocessor', text_repr_type='quad', epsilon_ratio=0.002)), data_preprocessor=dict( type='TextDetDataPreprocessor', mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], bgr_to_rgb=True, pad_size_divisor=32)) train_pipeline = [ dict(type='LoadImageFromFile', color_type='color_ignore_orientation'), dict( type='LoadOCRAnnotations', with_bbox=True, with_polygon=True, with_label=True), dict( type='TorchVisionWrapper', op='ColorJitter', brightness=0.12549019607843137, saturation=0.5), dict( type='ImgAugWrapper', args=[['Fliplr', 0.5], { 'cls': 'Affine', 'rotate': [-10, 10] }, ['Resize', [0.5, 3.0]]]), dict(type='RandomCrop', min_side_ratio=0.1), dict(type='Resize', scale=(1024, 1024), keep_ratio=True), dict(type='Pad', size=(1024, 1024)), dict( type='PackTextDetInputs', meta_keys=('img_path', 'ori_shape', 'img_shape')) ] test_pipeline = [ dict(type='LoadImageFromFile', color_type='color_ignore_orientation'), dict(type='Resize', scale=(1024, 1024), keep_ratio=True), dict( type='LoadOCRAnnotations', with_polygon=True, with_bbox=True, with_label=True), dict( type='PackTextDetInputs', meta_keys=('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'instances')) ] default_scope = 'mmocr' env_cfg = dict( cudnn_benchmark=False, mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), dist_cfg=dict(backend='nccl')) randomness = dict(seed=None) default_hooks = dict( timer=dict(type='IterTimerHook'), logger=dict(type='LoggerHook', interval=5), param_scheduler=dict(type='ParamSchedulerHook'), checkpoint=dict(type='CheckpointHook', interval=20), sampler_seed=dict(type='DistSamplerSeedHook'), sync_buffer=dict(type='SyncBuffersHook'), visualization=dict( type='VisualizationHook', interval=1, enable=False, show=False, draw_gt=False, draw_pred=False)) log_level = 'INFO' log_processor = dict(type='LogProcessor', window_size=10, by_epoch=True) load_from = 'https://download.openmmlab.com/mmocr/textdet/dbnetpp/tmp_1.0_pretrain/dbnetpp_r50dcnv2_fpnc_100k_iter_synthtext-20220502-352fec8a.pth' resume = False val_evaluator = dict(type='HmeanIOUMetric') test_evaluator = dict(type='HmeanIOUMetric') vis_backends = [dict(type='LocalVisBackend')] visualizer = dict( type='TextDetLocalVisualizer', name='visualizer', vis_backends=[dict(type='LocalVisBackend')]) icdar2015_textdet_data_root = '/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015' icdar2015_textdet_train = dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_train.json', filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=None) icdar2015_textdet_test = dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_test.json', test_mode=True, pipeline=None) optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='SGD', lr=0.0035, momentum=0.9, weight_decay=0.0001)) train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=1200, val_interval=20) val_cfg = dict(type='ValLoop') test_cfg = dict(type='TestLoop') param_scheduler = [dict(type='PolyLR', power=0.9, eta_min=1e-07, end=200)] train_list = [ dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_train.json', filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=None) ] test_list = [ dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_test.json', test_mode=True, pipeline=None) ] train_dataloader = dict( batch_size=8, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), dataset=dict( type='ConcatDataset', datasets=[ dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_train.json', filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=None) ], pipeline=[ dict( type='LoadImageFromFile', color_type='color_ignore_orientation'), dict( type='LoadOCRAnnotations', with_bbox=True, with_polygon=True, with_label=True), dict( type='TorchVisionWrapper', op='ColorJitter', brightness=0.12549019607843137, saturation=0.5), dict( type='ImgAugWrapper', args=[['Fliplr', 0.5], { 'cls': 'Affine', 'rotate': [-10, 10] }, ['Resize', [0.5, 3.0]]]), dict(type='RandomCrop', min_side_ratio=0.1), dict(type='Resize', scale=(1024, 1024), keep_ratio=True), dict(type='Pad', size=(1024, 1024)), dict( type='PackTextDetInputs', meta_keys=('img_path', 'ori_shape', 'img_shape')) ])) val_dataloader = dict( batch_size=8, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=False), dataset=dict( type='ConcatDataset', datasets=[ dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_test.json', test_mode=True, pipeline=None) ], pipeline=[ dict( type='LoadImageFromFile', color_type='color_ignore_orientation'), dict(type='Resize', scale=(1024, 1024), keep_ratio=True), dict( type='LoadOCRAnnotations', with_polygon=True, with_bbox=True, with_label=True), dict( type='PackTextDetInputs', meta_keys=('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'instances')) ])) test_dataloader = dict( batch_size=8, num_workers=8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=False), dataset=dict( type='ConcatDataset', datasets=[ dict( type='OCRDataset', data_root='/home/aipf/work/团队共享目录/zhixin/datasets/icdar2015', ann_file='textdet_test.json', test_mode=True, pipeline=None) ], pipeline=[ dict( type='LoadImageFromFile', color_type='color_ignore_orientation'), dict(type='Resize', scale=(1024, 1024), keep_ratio=True), dict( type='LoadOCRAnnotations', with_polygon=True, with_bbox=True, with_label=True), dict( type='PackTextDetInputs', meta_keys=('img_path', 'ori_shape', 'img_shape', 'scale_factor', 'instances')) ])) launcher = 'none' work_dir = 'output/dbpp0417_dcnv2'

qiuzhixin9527 avatar Apr 20 '23 08:04 qiuzhixin9527