mmpose icon indicating copy to clipboard operation
mmpose copied to clipboard

score_per_joint for Top-down approach

Open EugeneKok97 opened this issue 3 years ago • 7 comments

Hello, I'm trying to detect two keypoints of different category in an image that contains only one target object. However, for most cases, only one of them is present. Previously, I was able train the associative embedding detector to detect two keypoints, and filter away the incorrect one based on their confidence score by including 'score_per_joint' in my 'test_cfg'. However, I couldn't do the same for a top-down approach detector, the confidence scores seem to be the same for all joints.

I'm fairly new to pose estimation, please let me know if I'm missing something. Thank you!

EugeneKok97 avatar Sep 15 '22 06:09 EugeneKok97

Hi, thanks for using MMPose. Top-down models should also output per-joint scores. Could please provide more details like which top-down model are you using and how did you obtain the confidence scores?

ly015 avatar Sep 15 '22 09:09 ly015

@ly015 Hi, thank you for your swift reply. I'm using HRNet for the backbone and TopdownSimplehead for the keypoint head. The bounding box is taken as the size of the entire image.

The confidence score is taken from the third index of the returned keypoint obtained using inference_top_down_pose_model(pose_model, img, person_results=None, bbox_thr=None, format='xywh', dataset='TopDownCocoDataset', dataset_info=None, return_heatmap=False, outputs=None)

Returned output: [{'bbox': array([ 0, 0, 96, 96]), 'keypoints': array([[76.75 , 34.250004 , 0.8377565 ], [79.25 , 36.750004 , 0.80142593]], dtype=float32)}]

Two keypoints having the similar position with very similar confidence score

Config and dataset_ info:

log_level = 'INFO'
load_from = None
resume_from = None
dist_params = dict(backend='nccl')
workflow = [('train', 1)]
checkpoint_config = dict(interval=50)
evaluation = dict(interval=20, metric='mAP', key_indicator='AP')
optimizer = dict(type='Adam', lr=0.0005)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[170, 200])
total_epochs = 300
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
channel_cfg = dict(
    num_output_channels=2,
    dataset_joints=2,
    dataset_channel=[[0, 1]],
    inference_channel=[0, 1])
model = dict(
    type='TopDown',
    pretrained=
    'https://download.openmmlab.com/mmpose/pretrain_models/hrnet_w32-36af842e.pth',
    backbone=dict(
        type='HRNet',
        in_channels=3,
        extra=dict(
            stage1=dict(
                num_modules=1,
                num_branches=1,
                block='BOTTLENECK',
                num_blocks=(4, ),
                num_channels=(64, )),
            stage2=dict(
                num_modules=1,
                num_branches=2,
                block='BASIC',
                num_blocks=(4, 4),
                num_channels=(32, 64)),
            stage3=dict(
                num_modules=4,
                num_branches=3,
                block='BASIC',
                num_blocks=(4, 4, 4),
                num_channels=(32, 64, 128)),
            stage4=dict(
                num_modules=3,
                num_branches=4,
                block='BASIC',
                num_blocks=(4, 4, 4, 4),
                num_channels=(32, 64, 128, 256)))),
    keypoint_head=dict(
        type='TopDownSimpleHead',
        in_channels=32,
        out_channels=2,
        num_deconv_layers=0,
        extra=dict(final_conv_kernel=1),
        loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
    train_cfg=dict(),
    test_cfg=dict(
        flip_test=True,
        post_process='default',
        shift_heatmap=True,
        modulate_kernel=11))
data_cfg = dict(
    image_size=[96, 96],
    heatmap_size=[24, 24],
    num_output_channels=2,
    num_joints=2,
    dataset_channel=[[0, 1]],
    inference_channel=[0, 1],
    soft_nms=False,
    nms_thr=1.0,
    oks_thr=0.9,
    vis_thr=0.2,
    use_gt_bbox=True,
    det_bbox_thr=0.0,
    bbox_file=
    'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownRandomFlip', flip_prob=0.5),
    dict(
        type='TopDownHalfBodyTransform',
        num_joints_half_body=8,
        prob_half_body=0.3),
    dict(
        type='TopDownGetRandomScaleRotation', rot_factor=40, scale_factor=0.5),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(type='TopDownGenerateTarget', sigma=2),
    dict(
        type='Collect',
        keys=['img', 'target', 'target_weight'],
        meta_keys=[
            'image_file', 'joints_3d', 'joints_3d_visible', 'center', 'scale',
            'rotation', 'bbox_score', 'flip_pairs'
        ])
]
val_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'image_file', 'center', 'scale', 'rotation', 'bbox_score',
            'flip_pairs'
        ])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='TopDownAffine'),
    dict(type='ToTensor'),
    dict(
        type='NormalizeTensor',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'image_file', 'center', 'scale', 'rotation', 'bbox_score',
            'flip_pairs'
        ])
]
data_root = '/home/lmga-titanx/mmpose/data/testing_set'
data = dict(
    samples_per_gpu=16,
    workers_per_gpu=2,
    train=dict(
        type='TopDownCocoDataset',
        ann_file=
        '/home/lmga-titanx/mmpose/data/testing_set/annotations/person_keypoints_train.json',
        img_prefix='/home/lmga-titanx/mmpose/data/testing_set/images/',
        data_cfg=dict(
            image_size=[96, 96],
            heatmap_size=[24, 24],
            num_output_channels=2,
            num_joints=2,
            dataset_channel=[[0, 1]],
            inference_channel=[0, 1],
            soft_nms=False,
            nms_thr=1.0,
            oks_thr=0.9,
            vis_thr=0.2,
            use_gt_bbox=True,
            det_bbox_thr=0.0,
            bbox_file=
            'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
        ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='TopDownRandomFlip', flip_prob=0.5),
            dict(
                type='TopDownHalfBodyTransform',
                num_joints_half_body=8,
                prob_half_body=0.3),
            dict(
                type='TopDownGetRandomScaleRotation',
                rot_factor=40,
                scale_factor=0.5),
            dict(type='TopDownAffine'),
            dict(type='ToTensor'),
            dict(
                type='NormalizeTensor',
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
            dict(type='TopDownGenerateTarget', sigma=2),
            dict(
                type='Collect',
                keys=['img', 'target', 'target_weight'],
                meta_keys=[
                    'image_file', 'joints_3d', 'joints_3d_visible', 'center',
                    'scale', 'rotation', 'bbox_score', 'flip_pairs'
                ])
        ]),
    val=dict(
        type='TopDownCocoDataset',
        ann_file=
        '/home/lmga-titanx/mmpose/data/testing_set/annotations/person_keypoints_valid.json',
        img_prefix='/home/lmga-titanx/mmpose/data/testing_set/images/',
        data_cfg=dict(
            image_size=[96, 96],
            heatmap_size=[24, 24],
            num_output_channels=2,
            num_joints=2,
            dataset_channel=[[0, 1]],
            inference_channel=[0, 1],
            soft_nms=False,
            nms_thr=1.0,
            oks_thr=0.9,
            vis_thr=0.2,
            use_gt_bbox=True,
            det_bbox_thr=0.0,
            bbox_file=
            'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
        ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='TopDownAffine'),
            dict(type='ToTensor'),
            dict(
                type='NormalizeTensor',
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
            dict(
                type='Collect',
                keys=['img'],
                meta_keys=[
                    'image_file', 'center', 'scale', 'rotation', 'bbox_score',
                    'flip_pairs'
                ])
        ]),
    test=dict(
        type='TopDownCocoDataset',
        ann_file=
        '/home/lmga-titanx/mmpose/data/testing_set/annotations/person_keypoints_test.json',
        img_prefix='/home/lmga-titanx/mmpose/data/testing_set/images/',
        data_cfg=dict(
            image_size=[96, 96],
            heatmap_size=[24, 24],
            num_output_channels=2,
            num_joints=2,
            dataset_channel=[[0, 1]],
            inference_channel=[0, 1],
            soft_nms=False,
            nms_thr=1.0,
            oks_thr=0.9,
            vis_thr=0.2,
            use_gt_bbox=True,
            det_bbox_thr=0.0,
            bbox_file=
            'data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json'
        ),
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='TopDownAffine'),
            dict(type='ToTensor'),
            dict(
                type='NormalizeTensor',
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
            dict(
                type='Collect',
                keys=['img'],
                meta_keys=[
                    'image_file', 'center', 'scale', 'rotation', 'bbox_score',
                    'flip_pairs'
                ])
        ]))
work_dir = '/home/lmga-titanx/mmpose/work_dirs/topdown_hrnet_w32_coco_96x96'
gpu_ids = [0]

EugeneKok97 avatar Sep 16 '22 03:09 EugeneKok97

Is the original image in size [96, 96], the same as the model input? Please note that the bbox should match the actual image size (instead of the model input size) to take the entire image as the input.

ly015 avatar Sep 16 '22 04:09 ly015

@ly015 Yes, the size of the original image is [96,96], and the bounding box annotated in my coco json.file is [0,0,96,96].

EugeneKok97 avatar Sep 16 '22 04:09 EugeneKok97

Are these two types of keypoint visually similar to each other? And does the predicted keypoint location seem reasonable?

I also noted a few details: 1) If you are using TopDownCocoDataset, please make sure that the flip_pairs is properly given in the dataset. The original flip pair definition on COCO may lead to incorrect results on your data; 2) TopDownHalfBodyTransform in the training pipeline is unnecessary.

ly015 avatar Sep 16 '22 05:09 ly015

The problem that I'm trying to solve is to detect the stem and calyx of an apple. They are visually similar but distinguishable based on the orientation of the apple. Bottom-up method was able to solve this problem with significant difference between the confidence scores of the two predictions. The predicted location is reasonable for one keypoint, while the other predicted keypoint tends to occur at similar position that is incorrect.

Thanks for the tips! Regarding the flip pairs, I'm currently setting it as [0,1]. I've also tried setting flip_test = False in test_cfg, but both gave the same problem.

Here is an example of my prediction result: 2018_13

pose_results: [{'bbox': array([ 0, 0, 96, 96]), 'keypoints': array([[49.249996 , 51.749996 , 0.94243985], [49.249996 , 54.249996 , 0.850441 ]], dtype=float32)}]

EugeneKok97 avatar Sep 19 '22 03:09 EugeneKok97

Is the original image in size [96, 96], the same as the model input? Please note that the bbox should match the actual image size (instead of the model input size) to take the entire image as the input.

Hi, I am also using hrnet_w32_coco_384*288 config for training, original image size size image_size=[720, 1184],heatmap_size=[180, 296], but I get the following error:The size of tensor a (180) must match the size of tensor b (184) at non-sing.How should I solve it?

Outstanding avatar Dec 02 '22 06:12 Outstanding