[RAFT] How to calculate evaluation metrics?
Hello, thank you for sharing this great project!
I would like to ask about the evaluation metric calculation. The figures in the paper show the final accuracy, but I couldn’t find any explanation of how it is computed in the main text. In the source code (eval.py), it seems that the script only inputs documents and questions, then calls the model to generate answers, without any metric calculation.
Could you please clarify how the accuracy is calculated? This is very important to me. Thank you!
@ShishirPatil @tianjunz @kaiwen129 ^
Hi @JIAWENee thank you for your interest in the project! The evaluation metrics we report are those defined by each dataset. For example, with HotPotQA, we evaluate using string match against the ground-truth answer field. Hope that helps!