HellaSwag numbers?
Great work on this project! In Table 20 of the LLAMA-2 paper, it says that LLAMA-2 gets 77.2 accuracy on HellaSwag. The LLAMA-2 paper isn't clear on whether this is zero-shot, but Table 20 of the Falcon paper confirms that it is zero-shot. However, in Table 25 of the Wanda paper, it says that LLAMA-2 Dense gets 57.17 accuracy on HellaSwag.
This seems like a large gap. Could you help me to understand the gap? E.g. are there multiple metrics, or multiple versions of the dataset, or something else that could cause a gap like this?
Hi, thanks for the question. I think this might be related to the metrics, for the EleutherAI benchmark, it reports two metrics acc and acc_norm. In the paper, we report the acc metric. However, based on the log file from our experiments, it seems that for LLaMA-2-7B, the acc_norm number on HellaSwag is 76.00.
Thank you! Do you also have an acc_norm number for your experiments?