human-eval
human-eval copied to clipboard
pass@k on filtered samples
Hi,
Thank you for the great work!
I have 2 questions about the computation of the pass@k metric after applying filtering on the APPS benchmark.
-
Will the
total
array in the below code snippet contain numbers of filtered samples that passed the example test cases (from problem statement), i.e. each number <= N_original_samples(=1000)? https://github.com/openai/human-eval/blob/312c5e5532f0e0470bf47f77a6243e02a61da530/human_eval/evaluation.py#L85 -
In the cases when a number of filtered samples is less than k (=[1,5]), how do you compute the pass@k metric for these cases? For example, when N_filtered_samples = 1 and k=5, can we assume execution results of 4 failures and 1 passed/failure (depending on the final unit test results of this filtered sample)?