alexa-with-dstc9-track1-dataset
alexa-with-dstc9-track1-dataset copied to clipboard
Human evaluation results from Google Sheet not reproducible?
I am wondering how the human evaluation scores were computed exactly in this sheet https://docs.google.com/spreadsheets/d/1THEh9MRPWQCC1v4DH5WTw0Gq8TyV9zncWWUL08drtUY/edit#gid=452616194
For reference, here is what we end up (most-right column) with when taking the results from the current master branch (furthermore, Team 7 is missing entirely): https://docs.google.com/spreadsheets/d/1oEtzLyouTR-numPKS4WtMPSQTD6m9IzutXbtGwNNY5A/edit#gid=452616194
The absolute values and also the rankings are different. We compute the average over all generated responses and then multiply by the Detection F1-Score as provided in the paper.