alexa-with-dstc9-track1-dataset Human evaluation results from Google Sheet not reproducible?

Human evaluation results from Google Sheet not reproducible?

Open nils-hde opened this issue 1 year ago • 0 comments

I am wondering how the human evaluation scores were computed exactly in this sheet https://docs.google.com/spreadsheets/d/1THEh9MRPWQCC1v4DH5WTw0Gq8TyV9zncWWUL08drtUY/edit#gid=452616194

For reference, here is what we end up (most-right column) with when taking the results from the current master branch (furthermore, Team 7 is missing entirely): https://docs.google.com/spreadsheets/d/1oEtzLyouTR-numPKS4WtMPSQTD6m9IzutXbtGwNNY5A/edit#gid=452616194

The absolute values and also the rankings are different. We compute the average over all generated responses and then multiply by the Detection F1-Score as provided in the paper.

Aug 11 '23 09:08 nils-hde

alexa-with-dstc9-track1-dataset alexa-with-dstc9-track1-dataset copied to clipboard

Human evaluation results from Google Sheet not reproducible?

alexa-with-dstc9-track1-dataset
alexa-with-dstc9-track1-dataset copied to clipboard