lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Physics GRE task added
This PR adds the Physics GRE dataset released in the Public Inflection Benchmark by @InflectionAI. Please refer here for the details.
It solved the issue #1554.
Most welcome. Thanks for your prompt feedback.
- Yes, we have tested Mistral for this task. The sample result can be found here and sample output here.
- Indeed the ..._maj1 is already here. It is called
score-first
similar to thegsm8k-cot-self-consistency
task. - a. Inflection has not released the data preprocessing pipeline yet. b. It seems there should be one and only one correct answer in Physics GRE tests. Reference
Thanks a lot for getting back.
- I am not sure how we can detect a
model **attempting** to select more than one answer
using the regex. - The greedy generation is a great suggestion.
- It takes about 15 V100 hours to complete all the tasks for
Mixtral-8x7B-Instruct-v0.1 4-bit
. ForMistral-7B-Instruct-v0.2
, it is 6.5 hours.
@haileyschoelkopf what is
I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.
based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect]
is an adjustment common to standardized testing that makes random guessing score 0 in expectation.
To make things concrete: let the correct answer be C
. Now the model may predict:
-
A
(wrong) -
A, B, C
(wrong) -
C, A, B
(wrong) -
C
(correct)
The evaluator will get 1 and 4 judged as expected.
But even though 3 is wrong, the regex parsing/filtering will say the answer to be C
. Thus the model will be judged as correct, even if it is incorrect. If this is an issue, it should be noted for any MCQ tasks.
based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.
No, I am referring to the fact that this text from the link:
Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.
Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this
Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.
Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this
I don't think so. I think this is just saying that missing, malformed, and incorrect answers are all treated the same way. I don't read this as implying that some questions have multiple correct answers and that in such cases you should only answer with one of them.