lm-evaluation-harness MultiMedQA

MultiMedQA

Open tmabraham opened this issue 8 months ago • 3 comments

This PR implements the MultiMedQA suite of tasks:

Adds MedQA 4-options task
Added MedMCQA task
Adds MultiMedQA group that includes the above-mentioned tasks and also PubMedQA and a variety of MMLU tasks

Note that MultiMedQA also technically includes longform answer tasks (LiveQA, MedicationQA, HealthSearchQA). However, these tasks are evaluated by expert evaluation, and are therefore ignored. Other papers also only focus on the multiple choice QA tasks when evaluating on MultiMedQA.

Benchmark for Llama-2-7b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.3803|±  |0.0944|
|                        |       |none  |     0|acc_norm|0.3382|±  |0.0001|
| - medmcqa              |Yaml   |none  |     0|acc     |0.3438|±  |0.0073|
|                        |       |none  |     0|acc_norm|0.3438|±  |0.0073|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.3197|±  |0.0131|
|                        |       |none  |     0|acc_norm|0.3197|±  |0.0131|
| - anatomy              |Yaml   |none  |     0|acc     |0.4296|±  |0.0428|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.4377|±  |0.0305|
| - college_biology      |Yaml   |none  |     0|acc     |0.4444|±  |0.0416|
| - college_medicine     |Yaml   |none  |     0|acc     |0.4277|±  |0.0377|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.4700|±  |0.0502|
| - professional_medicine|Yaml   |none  |     0|acc     |0.4338|±  |0.0301|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7140|±  |0.0202|

Using vLLM, we can easily run the benchmarks on larger models, like Llama-2-70b:

|         Tasks          |Version|Filter|n-shot| Metric |Value |   |Stderr|
|------------------------|-------|------|-----:|--------|-----:|---|-----:|
|stem                    |N/A    |none  |     0|acc     |0.5493|±  |0.0966|
|                        |       |none  |     0|acc_norm|0.4993|±  |0.0008|
| - medmcqa              |Yaml   |none  |     0|acc     |0.4808|±  |0.0077|
|                        |       |none  |     0|acc_norm|0.4808|±  |0.0077|
| - medqa_4options       |Yaml   |none  |     0|acc     |0.5601|±  |0.0139|
|                        |       |none  |     0|acc_norm|0.5601|±  |0.0139|
| - anatomy              |Yaml   |none  |     0|acc     |0.5556|±  |0.0429|
| - clinical_knowledge   |Yaml   |none  |     0|acc     |0.7019|±  |0.0282|
| - college_biology      |Yaml   |none  |     0|acc     |0.7847|±  |0.0344|
| - college_medicine     |Yaml   |none  |     0|acc     |0.6879|±  |0.0353|
| - medical_genetics     |Yaml   |none  |     0|acc     |0.7200|±  |0.0451|
| - professional_medicine|Yaml   |none  |     0|acc     |0.7684|±  |0.0256|
| - pubmedqa             |Yaml   |none  |     0|acc     |0.7440|±  |0.0195|

(It would be nice if instead of saying stem it said multimedqa but I can't seem to figure that out... not a huge problem though, just an aesthetic issue)

Work done in collaboration with @jbdel and @katielink at @MedARC-AI.

Dec 22 '23 03:12 tmabraham

The stem seems to be a bug. It's an alias for the mmlu_stem sub-group. Will make a fix. Functionally, though, it won't change the evals and the result json should reflect the task name correctly.

Dec 22 '23 05:12 lintangsutawika

@lintangsutawika I didnt know the tasks have their own README... is there some sort of template or example of that?

Dec 23 '23 08:12 tmabraham

Yup, you can check here

https://github.com/EleutherAI/lm-evaluation-harness/blob/main/templates/new_yaml_task/README.md

Dec 23 '23 09:12 lintangsutawika

sorry for the delay, let me know if this README looks fine

Jan 11 '24 02:01 tmabraham

Looks great!

Jan 11 '24 03:01 lintangsutawika

lm-evaluation-harness lm-evaluation-harness copied to clipboard

MultiMedQA

lm-evaluation-harness
lm-evaluation-harness copied to clipboard