lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Physics GRE task added

Open ShayekhBinIslam opened this issue 10 months ago • 7 comments

This PR adds the Physics GRE dataset released in the Public Inflection Benchmark by @InflectionAI. Please refer here for the details.

It solved the issue #1554.

ShayekhBinIslam avatar Apr 01 '24 11:04 ShayekhBinIslam

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 01 '24 11:04 CLAassistant

Most welcome. Thanks for your prompt feedback.

  1. Yes, we have tested Mistral for this task. The sample result can be found here and sample output here.
  2. Indeed the ..._maj1 is already here. It is called score-first similar to the gsm8k-cot-self-consistency task.
  3. a. Inflection has not released the data preprocessing pipeline yet. b. It seems there should be one and only one correct answer in Physics GRE tests. Reference

ShayekhBinIslam avatar Apr 02 '24 18:04 ShayekhBinIslam

Thanks a lot for getting back.

  1. I am not sure how we can detect a model **attempting** to select more than one answer using the regex.
  2. The greedy generation is a great suggestion.
  3. It takes about 15 V100 hours to complete all the tasks for Mixtral-8x7B-Instruct-v0.1 4-bit. For Mistral-7B-Instruct-v0.2, it is 6.5 hours.

ShayekhBinIslam avatar Apr 14 '24 15:04 ShayekhBinIslam

@haileyschoelkopf what is

I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

StellaAthena avatar Apr 18 '24 12:04 StellaAthena

To make things concrete: let the correct answer be C. Now the model may predict:

  1. A (wrong)
  2. A, B, C (wrong)
  3. C, A, B (wrong)
  4. C (correct)

The evaluator will get 1 and 4 judged as expected.

But even though 3 is wrong, the regex parsing/filtering will say the answer to be C. Thus the model will be judged as correct, even if it is incorrect. If this is an issue, it should be noted for any MCQ tasks.

ShayekhBinIslam avatar Apr 18 '24 12:04 ShayekhBinIslam

based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. correct - 0.25 [incorrect] is an adjustment common to standardized testing that makes random guessing score 0 in expectation.

No, I am referring to the fact that this text from the link:

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

haileyschoelkopf avatar Apr 18 '24 13:04 haileyschoelkopf

Your score will be determined by the number of questions you answer correctly. Questions you answer incorrectly or for which you mark no answer or more than one answer are counted as incorrect. Nothing is subtracted from a score if you answer a question incorrectly. Therefore, to maximize your score it is better for you to guess at an answer than not to respond at all.

Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this

I don't think so. I think this is just saying that missing, malformed, and incorrect answers are all treated the same way. I don't read this as implying that some questions have multiple correct answers and that in such cases you should only answer with one of them.

StellaAthena avatar Apr 18 '24 15:04 StellaAthena