piqa icon indicating copy to clipboard operation
piqa copied to clipboard

Performance difference between evaluate.py vs piqa_evaluate.py

Open jhyuklee opened this issue 6 years ago • 2 comments

Performances of two evaluation scripts differ as follows:

$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json 
{"exact_match": 52.81929990539262, "f1": 63.28879733489547}
$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 52.28949858088931, "f1": 62.72236634535493}

Difference is about 0.5~0.6, and tested model is LSTM+SA+ELMo.

jhyuklee avatar Oct 04 '18 07:10 jhyuklee

$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json
{"exact_match": 53.207190160832546, "f1": 63.382281758599724}

$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 53.39640491958373, "f1": 63.51748187339812}

Sometimes it goes up.

jhyuklee avatar Oct 05 '18 03:10 jhyuklee

Have found that prediction json files are quite different, too. More than 600 preds out of 10570 preds are different between (original) pred.json and piqa version pred.json. One possible cause is the different test behaviors (outer product of start, end probs vs. inner product of phrase and query vecs). Outer product of start, end probs (which was desgined for efficient learning and testing) could result in a different ranking compared to the inner product.

jhyuklee avatar Oct 05 '18 14:10 jhyuklee