visdial-bert
visdial-bert copied to clipboard
Validation-only option
Hi,
I'm having difficulties evaluating the model on the validation set.
I changed the evaluate.py script as follow:
-
changed the split parameter to 'val'
-
changed the ranks_json to support rounds:
for r_idx in range(10): ranks_json.append( { "image_id": batch["image_id"][i].item(), "round_id": int(r_idx) + 1, "ranks": [ rank.item() for rank in ranks[i][r_idx][:] ], } )
-
I added default rankings for the missing validation samples, i.e.:
535946 1
535946 2
535946 3
535946 4
535946 5
535946 6
535946 7
535946 8
535946 9
535946 10
195186 1
195186 2
195186 3
195186 4
195186 5
195186 6
195186 7
195186 8
195186 9
195186 10
193304 1
193304 2
193304 3
193304 4
193304 5
193304 6
193304 7
193304 8
193304 9
193304 10
261307 1
261307 2
261307 3
261307 4
261307 5
261307 6
261307 7
261307 8
261307 9
261307 10
Still, the submitted JSON file gets low scores on EvalAI. Can you please help with a validation-only feature?
Interesting, will take a look at this. I think the issue is that for the validation samples, the GT option is at index 0. Submitting to EvalAI might require some re-arranging of the options. I think changing the val dataloader to avoid doing this would be less confusing. I can try to push this change in a day or two.
Let me know if this was indeed the issue.
Thanks, I will look into it next week, (Happy holiday)
Hello, I have encountered the same difficulty. I found that the function scores_to_ranks
in visdial_metrics.py
does not support scores that contain multiple rounds, thus, calculating incorrect ranks for the validation split. Hence, in addition to idansc's changes, I have modified scores_to_ranks
to support multiple rounds:
def scores_to_ranks(scores: torch.Tensor):
"""Convert model output scores into ranks."""
batch_size, num_rounds, num_options = scores.size()
# sort in descending order - largest score gets highest rank
sorted_ranks, ranked_idx = scores.sort(-1, descending=True)
# i-th position in ranked_idx specifies which score shall take this
# position but we want i-th position to have rank of score at that
# position, do this conversion
ranks = ranked_idx.clone().fill_(0)
ranks = ranks.view(batch_size, num_rounds, num_options)
for b in range(batch_size):
for i in range(num_rounds):
for j in range(num_options):
ranks[b][i][ranked_idx[b, i, j]] = j
# convert from 0-99 ranks to 1-100 ranks
ranks += 1
return ranks
Still, EvalAI scores are very low. I saw you mentioned that GTs are different in the validation split but I do not understand why they are relevant for evaluation. Could you please clarify this point and help with inference on the validation split?
Hi Daniel,
For the validation split, for convenience, the GT index of the correct option is set as 0 (https://github.com/vmurahari3/visdial-bert/blob/87e264794c45cc5c8c1ea243ad9d2b4d94a44faf/dataloader/dataloader_visdial.py#L269) Therefore, for the validation samples, for generating the EvalAI file, the original order has to be restored.
ranks = sample\['ranks'\].copy()
gt_answer_rank = ranks.pop(0)
ranks.insert(gt_index, gt_answer_rank)
I think it would be a good idea for me to add support for EvalAI evaluation on the validation split. I can push that in a couple of days.
ranks.insert(gt_index, gt_answer_rank)
Hi, validated, and this was indeed the issue. Thanks for the support! I'd leave the issue open just in case you are working on an evaluation option, but feel free to close :)
Hi authors,
Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change?
Thanks!
Best, Lu
Hi authors,
Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change?
Thanks!
Best, Lu
Meanwhile, you can use the code here: Two-step ranks ensemble.ipynb (see Box 5 function align_bert_model)
It takes a working output (also available in the project above) and aligns this project's output accordingly.
Hi authors, Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change? Thanks! Best, Lu
Meanwhile, you can use the code here: Two-step ranks ensemble.ipynb (see Box 5 function align_bert_model)
It takes a working output (also available in the project above) and aligns this project's output accordingly.
Thanks so much. BTW, I found the val data (all rounds) and test data (single round) are different. Shall we change the code for val set evaluation?