piqa Performance difference between evaluate.py vs piqa

Performance difference between evaluate.py vs piqa_evaluate.py

Open jhyuklee opened this issue 6 years ago • 2 comments

Performances of two evaluation scripts differ as follows:

$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json 
{"exact_match": 52.81929990539262, "f1": 63.28879733489547}
$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 52.28949858088931, "f1": 62.72236634535493}

Difference is about 0.5~0.6, and tested model is LSTM+SA+ELMo.

Oct 04 '18 07:10 jhyuklee

$ python evaluate.py $SQUAD_DEV_PATH /tmp/piqa/pred.json
{"exact_match": 53.207190160832546, "f1": 63.382281758599724}

$ python piqa_evaluate.py $SQUAD_DEV_PATH /tmp/piqa/context_emb/ /tmp/piqa/question_emb/
{"exact_match": 53.39640491958373, "f1": 63.51748187339812}

Sometimes it goes up.

Oct 05 '18 03:10 jhyuklee

Have found that prediction json files are quite different, too. More than 600 preds out of 10570 preds are different between (original) pred.json and piqa version pred.json. One possible cause is the different test behaviors (outer product of start, end probs vs. inner product of phrase and query vecs). Outer product of start, end probs (which was desgined for efficient learning and testing) could result in a different ranking compared to the inner product.

Oct 05 '18 14:10 jhyuklee

piqa piqa copied to clipboard

Performance difference between evaluate.py vs piqa_evaluate.py

piqa
piqa copied to clipboard