apps Problem in ground-truth solutions

Problem in ground-truth solutions

Open nicoladainese96 opened this issue 9 months ago • 0 comments

Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:

Test Case Average (average accuracy over problems) = 0.27318586602648753 Strict Accuracy (all test cases passed / total problems) = 0.263

Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?

Thank you in advance for any help!

Screenshot 2024-04-29 at 14 35 30

Apr 29 '24 11:04 nicoladainese96

apps apps copied to clipboard

Problem in ground-truth solutions

apps
apps copied to clipboard