visdial-bert
visdial-bert copied to clipboard
Results from the test server doesn't match the repoted results.
Hello, I used download_preprocessed.sh
to download all the pretrained state dicts. Then I used python evaluate.py -n_gpus 8 -start_path ./checkpoints-release/basemodel_dense -save_name bestmodel_plus_dense
to get the submission text file and submitted the file to the test server. The result given by server is {"test-std": {"MRR (x 100)": 13.4486998706508, "R@1": 4.075, "R@5": 18.825, "R@10": 34.1, "Mean": 23.001, "NDCG (x 100)": 30.34857283511257}}. which doesn't match the reported results.
I ensured the state dict is correctely loaded. I used python evaluate.py -n_gpus 8 -start_path ./checkpoints-release/basemodel_dense_nsp -save_name basemodel_dense_nsp
to get another result and submitted the result to the test server. I got this result: {"test-std": {"MRR (x 100)": 9.081483141263188, "R@1": 2.125, "R@5": 10.5, "R@10": 20.7, "Mean": 31.2615, "NDCG (x 100)": 18.93568245023515}}, which doesn't match the reported results, either.
What do you think might be the cause of this? Thank you very much.