lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

MultiMedQA

Open tmabraham opened this issue 8 months ago • 3 comments

This PR implements the MultiMedQA suite of tasks:

  1. Adds MedQA 4-options task
  2. Added MedMCQA task
  3. Adds MultiMedQA group that includes the above-mentioned tasks and also PubMedQA and a variety of MMLU tasks

Note that MultiMedQA also technically includes longform answer tasks (LiveQA, MedicationQA, HealthSearchQA). However, these tasks are evaluated by expert evaluation, and are therefore ignored. Other papers also only focus on the multiple choice QA tasks when evaluating on MultiMedQA.

Benchmark for Llama-2-7b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.3803|±  |0.0944|
|                        |       |none  |     0|acc_norm|0.3382|±  |0.0001|
| - medmcqa              |Yaml   |none  |     0|acc     |0.3438|±  |0.0073|
|                        |       |none  |     0|acc_norm|0.3438|±  |0.0073|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.3197|±  |0.0131|
|                        |       |none  |     0|acc_norm|0.3197|±  |0.0131|
| - anatomy              |Yaml   |none  |     0|acc     |0.4296|±  |0.0428|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.4377|±  |0.0305|
| - college_biology      |Yaml   |none  |     0|acc     |0.4444|±  |0.0416|
| - college_medicine     |Yaml   |none  |     0|acc     |0.4277|±  |0.0377|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.4700|±  |0.0502|
| - professional_medicine|Yaml   |none  |     0|acc     |0.4338|±  |0.0301|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7140|±  |0.0202|

Using vLLM, we can easily run the benchmarks on larger models, like Llama-2-70b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.5493|±  |0.0966|
|                        |       |none  |     0|acc_norm|0.4993|±  |0.0008|
| - medmcqa              |Yaml   |none  |     0|acc     |0.4808|±  |0.0077|
|                        |       |none  |     0|acc_norm|0.4808|±  |0.0077|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.5601|±  |0.0139|
|                        |       |none  |     0|acc_norm|0.5601|±  |0.0139|
| - anatomy              |Yaml   |none  |     0|acc     |0.5556|±  |0.0429|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.7019|±  |0.0282|
| - college_biology      |Yaml   |none  |     0|acc     |0.7847|±  |0.0344|
| - college_medicine     |Yaml   |none  |     0|acc     |0.6879|±  |0.0353|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.7200|±  |0.0451|
| - professional_medicine|Yaml   |none  |     0|acc     |0.7684|±  |0.0256|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7440|±  |0.0195|

(It would be nice if instead of saying stem it said multimedqa but I can't seem to figure that out... not a huge problem though, just an aesthetic issue)

Work done in collaboration with @jbdel and @katielink at @MedARC-AI.

tmabraham avatar Dec 22 '23 03:12 tmabraham

The stem seems to be a bug. It's an alias for the mmlu_stem sub-group. Will make a fix. Functionally, though, it won't change the evals and the result json should reflect the task name correctly.

lintangsutawika avatar Dec 22 '23 05:12 lintangsutawika

@lintangsutawika I didnt know the tasks have their own README... is there some sort of template or example of that?

tmabraham avatar Dec 23 '23 08:12 tmabraham

Yup, you can check here

https://github.com/EleutherAI/lm-evaluation-harness/blob/main/templates/new_yaml_task/README.md

lintangsutawika avatar Dec 23 '23 09:12 lintangsutawika

sorry for the delay, let me know if this README looks fine

tmabraham avatar Jan 11 '24 02:01 tmabraham

Looks great!

lintangsutawika avatar Jan 11 '24 03:01 lintangsutawika