VidHOI
VidHOI copied to clipboard
Evaluation only considering single-label?
Hi.
I have a question when I using the vidor_eval.ipynb script to generate mAP. The script seems to only support single-label case? If a ground-truth human-object pair has multiple interactions, for example gt <human1, (watch, next to), obj2>, only <human1, watch, obj2> can be matched to a prediction. This gt pair <human1, obj2> is then added to gt_bbox_pair_matched and cannot be matched to other predictions.
Thank you
Hi @nizhf thanks for your interest in our work! I think our vidor_eval.ipynb indeed supports multi-label evaluation. We loop through all the predicted HOI triplets, and when there's a match, we append the specific triplet_class to gt_bbox_pair_matched. Note that it's possible that there're more than one triplet with the same subject and object in the predicted HOI triplets.
I think what you append to gt_bbox_pair_matched is the index of the gt_pair. In gt_bbox_pair_matched.add(max_gt_id), the max_gt_id is set as max_gt_id = k, and k is from this line for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids), which is the index of the gt_bbox_pair_id, but not a triplet.
Let me clarify: the idea of evaluation is:
for each predicted HOI triplet
for each ground truth HOI triplet (k)
if there's a match
set is_match to True
record k or update with the maximum overlapping object-pair boxes
if there's a match
add the matched, predicted HOI triplet into the true positive set
else
add into the false positive set
As the ground truth HOI triplets are multi-label, the predictions also can match them.
Thank you for detailed clarification.
What confuses me is for each ground truth HOI triplet (k). In the evaluation script, it refers to for k, gt_bbox_pair_id in enumerate(gt_bbox_pair_ids), and gt_bbox_pair_ids = result['gt_bbox_pair_ids'].
I checked the result JSON file, gt_bbox_pair_ids are for example 'gt_bbox_pair_ids': [[0, 1], [1, 0]]. If I understand correctly, these point to the index of gt_boxes. So maybe here is only for each ground-truth pair (k)? The ground truth HOI triplet is obtained by gt_rel_cls = result['gt_action_labels'][k][j]. If there is a match, the ground-truth pair k is added to gt_bbox_pair_matched. This pair then cannot be matched to other predicted triplet.
Just a detailed example:
Assume we have two predicted HOIs: <human1, watch, obj2> and <human1, next_to, obj2>. The gt_bbox_pair_ids is [[0, 1]]. The gt_action_labels has 1.0 for watch and next_to.
We first process prediction <human1, watch, obj2>. We have k=0 and j=index_of_watch. Then we have result['gt_action_labels'][k][j]=1.0. This is a match, we add <human1, watch, obj2> to tp and k=0 to gt_bbox_pair_matched.
Then we process prediction <human1, next_to, obj2>. We have k=0 and j=index_of_next_to. We also have result['gt_action_labels'][k][j]=1.0. There should be a match, but we check that k=0 is already in gt_bbox_pair_matched, so <human1, watch, obj2> is falsely added to fp.
I hope I described my understanding of the vidor_eval_ipynb script clearly.