visdial-bert icon indicating copy to clipboard operation
visdial-bert copied to clipboard

Validation-only option

Open idansc opened this issue 4 years ago • 9 comments

Hi,
I'm having difficulties evaluating the model on the validation set. I changed the evaluate.py script as follow:

  • changed the split parameter to 'val'

  • changed the ranks_json to support rounds: for r_idx in range(10): ranks_json.append( { "image_id": batch["image_id"][i].item(), "round_id": int(r_idx) + 1, "ranks": [ rank.item() for rank in ranks[i][r_idx][:] ], } )

  • I added default rankings for the missing validation samples, i.e.:

535946 1
535946 2
535946 3
535946 4
535946 5
535946 6
535946 7
535946 8
535946 9
535946 10
195186 1
195186 2
195186 3
195186 4
195186 5
195186 6
195186 7
195186 8
195186 9
195186 10
193304 1
193304 2
193304 3
193304 4
193304 5
193304 6
193304 7
193304 8
193304 9
193304 10
261307 1
261307 2
261307 3
261307 4
261307 5
261307 6
261307 7
261307 8
261307 9
261307 10

Still, the submitted JSON file gets low scores on EvalAI. Can you please help with a validation-only feature?

idansc avatar Nov 22 '20 13:11 idansc

Interesting, will take a look at this. I think the issue is that for the validation samples, the GT option is at index 0. Submitting to EvalAI might require some re-arranging of the options. I think changing the val dataloader to avoid doing this would be less confusing. I can try to push this change in a day or two.

Let me know if this was indeed the issue.

vmurahari3 avatar Nov 24 '20 20:11 vmurahari3

Thanks, I will look into it next week, (Happy holiday)

idansc avatar Nov 28 '20 21:11 idansc

Hello, I have encountered the same difficulty. I found that the function scores_to_ranks in visdial_metrics.py does not support scores that contain multiple rounds, thus, calculating incorrect ranks for the validation split. Hence, in addition to idansc's changes, I have modified scores_to_ranks to support multiple rounds:

def scores_to_ranks(scores: torch.Tensor):
    """Convert model output scores into ranks."""
    batch_size, num_rounds, num_options = scores.size()

    # sort in descending order - largest score gets highest rank
    sorted_ranks, ranked_idx = scores.sort(-1, descending=True)

    # i-th position in ranked_idx specifies which score shall take this
    # position but we want i-th position to have rank of score at that
    # position, do this conversion
    ranks = ranked_idx.clone().fill_(0)
    ranks = ranks.view(batch_size, num_rounds, num_options)
    for b in range(batch_size):
        for i in range(num_rounds):
            for j in range(num_options):
                ranks[b][i][ranked_idx[b, i, j]] = j
    # convert from 0-99 ranks to 1-100 ranks
    ranks += 1
    return ranks

Still, EvalAI scores are very low. I saw you mentioned that GTs are different in the validation split but I do not understand why they are relevant for evaluation. Could you please clarify this point and help with inference on the validation split?

yakobyd avatar Dec 16 '20 20:12 yakobyd

Hi Daniel,

For the validation split, for convenience, the GT index of the correct option is set as 0 (https://github.com/vmurahari3/visdial-bert/blob/87e264794c45cc5c8c1ea243ad9d2b4d94a44faf/dataloader/dataloader_visdial.py#L269) Therefore, for the validation samples, for generating the EvalAI file, the original order has to be restored.

ranks = sample\['ranks'\].copy() 
gt_answer_rank = ranks.pop(0) 
ranks.insert(gt_index, gt_answer_rank)

vmurahari3 avatar Dec 22 '20 04:12 vmurahari3

I think it would be a good idea for me to add support for EvalAI evaluation on the validation split. I can push that in a couple of days.

vmurahari3 avatar Dec 22 '20 04:12 vmurahari3

ranks.insert(gt_index, gt_answer_rank)

Hi, validated, and this was indeed the issue. Thanks for the support! I'd leave the issue open just in case you are working on an evaluation option, but feel free to close :)

idansc avatar Jan 23 '21 08:01 idansc

Hi authors,

Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change?

Thanks!

Best, Lu

yulu0724 avatar Aug 10 '21 06:08 yulu0724

Hi authors,

Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change?

Thanks!

Best, Lu

Meanwhile, you can use the code here: Two-step ranks ensemble.ipynb (see Box 5 function align_bert_model)

It takes a working output (also available in the project above) and aligns this project's output accordingly.

idansc avatar Aug 10 '21 08:08 idansc

Hi authors, Did you update the code for evaluating the val set finally? I still got some errors when testing on val? What shall I change? Thanks! Best, Lu

Meanwhile, you can use the code here: Two-step ranks ensemble.ipynb (see Box 5 function align_bert_model)

It takes a working output (also available in the project above) and aligns this project's output accordingly.

Thanks so much. BTW, I found the val data (all rounds) and test data (single round) are different. Shall we change the code for val set evaluation?

yulu0724 avatar Aug 11 '21 02:08 yulu0724