evals icon indicating copy to clipboard operation
evals copied to clipboard

[Eval] Medical: MedMCQA 💉🩺 (300 samples, 182822 generatable) (0.36 accuracy)

Open SinanAkkoyun opened this issue 1 year ago • 1 comments

Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.

Eval details 📑

Eval name

MedMCQA Evaluation

Eval description

This eval takes the 182822 sample big dataset of medical domain question answering for large language models and formats it into a GPT-4 chat format. I currently include 300 samples, for converting all 182822 samples, see evals/registry/data/medmcqa/convert.js for converting the whole dataset. (including expert answers)

What makes this a useful eval?

ChatGPT/GPT-4 is a great education assistant. Evaluating it's zero-shot knowledge of medical topic answering gives great insight into the possibilities for shortcomings in building a medically/scientifically accurate model that learns to evade trick questions. Enhancing and studying it's confident answering enables further alignment and safety research, as GPT will more and more take over the medical educational industry. We hope in gaining access to GPT-4 to build a meds educator with context based fact retrieval (and enhance image capabilities) and would be very happy to hopefully receive access to the 8k (and someday hopefully 32k) models. :)

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

  • [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • [x] Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Eval includes truthful expert answers for training (see convert.js) 🥼 Robust performance check against real-world medical usecases 🌍 [2023-03-15 11:25:05,218] [oaieval.py:209] Final report: [...] accuracy: 0.36220472440944884 (gpt-3.5-turbo) 📝

Eval structure 🏗️

Your eval should

  • [x] Check that your data is in evals/registry/data/{name}
  • [x] Check that your yaml is registered at evals/registry/evals/{name}.jsonl
  • [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

  • [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ([email protected])

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

  • [x] I have filled out all required fields in the evals PR form
  • [] (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Anatomy\n\nChronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma\n\na) Hyperplasia\nb) Hyperophy\nc) Atrophy\nd) Dyplasia"}],"ideal":"c) Atrophy"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Biochemistry\n\nWhich vitamin is supplied from only animal source:\n\na) Vitamin C\nb) Vitamin B7\nc) Vitamin B12\nd) Vitamin D"}],"ideal":"c) Vitamin B12"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Surgery\n\nAll of the following are surgical options for morbid obesity except -\n\na) Adjustable gastric banding\nb) Biliopancreatic diversion\nc) Duodenal Switch\nd) Roux en Y Duodenal By pass"}],"ideal":"d) Roux en Y Duodenal By pass"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Ophthalmology\n\nFollowing endaerectomy on the right common carotid, a patient is found to be blind in the right eye. It is appears that a small thrombus embolized during surgery and lodged in the aery supplying the optic nerve. Which aery would be blocked?\n\na) Central aery of the retina\nb) Infraorbital aery\nc) Lacrimal aery\nd) Nasociliary aretry"}],"ideal":"a) Central aery of the retina"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Physiology\n\nGrowth hormone has its effect on growth through?\n\na) Directly\nb) IG1-1\nc) Thyroxine\nd) Intranuclear receptors"}],"ideal":"b) IG1-1"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Social & Preventive Medicine\n\nScrub typhus is transmitted by: September 2004\n\na) Louse\nb) Tick\nc) Mite\nd) Milk"}],"ideal":"c) Mite"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Gynaecology & Obstetrics\n\nAbnormal vascular patterns seen with colposcopy in case of cervical intraepithelial neoplasia  are all except\n\na) Punctation\nb) Mosaicism\nc) Satellite lesions\nd) Atypical vessels"}],"ideal":"c) Satellite lesions"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Surgery\n\nPer rectum examination is not a useful test for diagnosis of\n\na) Anal fissure\nb) Hemorrhoid\nc) Pilonidal sinus\nd) Rectal ulcer"}],"ideal":"c) Pilonidal sinus"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Anaesthesia\n\nCharacteristics of Remifentanyl – a) Metabolised by plasma esteraseb) Short half lifec) More potent than Alfentanyld) Dose reduced in hepatic and renal diseasee) Duration of action more than Alfentanyl\n\na) ab\nb) bc\nc) abc\nd) bcd"}],"ideal":"c) abc"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Psychiatry\n\nHypomimia is ?\n\na) Decreased ability to copy\nb) Decreased execution\nc) Deficit of expression by gesture\nd) Deficit of fluent speech"}],"ideal":"c) Deficit of expression by gesture"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Microbiology\n\nNaglers reaction is shown by\n\na) Clostridium tetani\nb) Clostridium botulinum\nc) Clostridium perfringens\nd) Clostridium septicum"}],"ideal":"c) Clostridium perfringens"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Medicine\n\nWhich of the following statements are True/False? 1. Hirsutism, which is defined as androgen-dependent excessive male pattern hair growth, affects approximately 25% of women. 2. Virilization refers to a condition in which androgen levels are sufficiently high to cause additional signs and symptoms. 3. Frequently, patients with growth hormone excess (i.e., acromegaly) present with hirsutism. 4. A simple and commonly used method to grade hair growth is the modified scale of Ferriman and Gallwey. 5. Scores above 8 suggest excess androgen-mediated hair growth.\n\na) 1, 2, 3 True & 4, 5 false\nb) 1, 3, 5 True & 2, 4 false\nc) 2, 4, 5 True & 1, 3 false\nd) 1, 2, 3, 4 True & 5 false"}],"ideal":"c) 2, 4, 5 True & 1, 3 false"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Anatomy\n\nThe pharmakokinetic change occurring in geriatric patient is due to\n\na) Gastric absorption\nb) Liver metabolism\nc) Renal clearance\nd) Hypersensitivity"}],"ideal":"c) Renal clearance"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Microbiology\n\nTrue regarding lag phase is?\n\na) Time taken to adpt in the new environment\nb) Growth occurs exponentially\nc) The plateau in lag phase is due to cell death\nd) It is the 2nd phase in bacterial growth curve"}],"ideal":"a) Time taken to adpt in the new environment"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Surgery\n\nA 60 yr old chronic smoker presents with painless gross hematuria of 1 day duration. Investigation of choice to know the cause of hematuria\n\na) USG\nb) X-ray KUB\nc) Urine routine\nd) Urine microscopy for malignant cytology cells"}],"ideal":"d) Urine microscopy for malignant cytology cells"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Pharmacology\n\nWith which of the following receptors theophylline has an antagonistic interaction ?\n\na) Histamine receptors\nb) Bradykinin receptors\nc) Adenosine receptors\nd) Imidazoline receptors"}],"ideal":"c) Adenosine receptors"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Anatomy\n\nHyper viscosity is seen in\n\na) Cryoglobulinemia\nb) Multiple myeloma\nc) MGUS\nd) Lymphoma"}],"ideal":"a) Cryoglobulinemia"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Social & Preventive Medicine\n\nFor a positively skewed curve which measure of central tendency is largest\n\na) Mean\nb) Mode\nc) Median\nd) All are equal"}],"ideal":"a) Mean"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Dental\n\nThe process of hardening a cement matrix through hydration with oral fluids  to achieve greater mechanical strength is known as:\n\na) Maturation\nb) Setting\nc) Hardening\nd) Mineralization"}],"ideal":"a) Maturation"}
{"input":[{"role":"system","content":"You are a highly intelligent doctor who answers the following multiple choice question correctly.\nOnly write the answer down."},{"role":"user","content":"Subject: Anatomy\n\nSuperior vena cava is derived from:\n\na) Aortic arch\nb) Pharyngeal arch\nc) Cardinal vein\nd) Vitelline vein"}],"ideal":"c) Cardinal vein"}

SinanAkkoyun avatar Mar 15 '23 11:03 SinanAkkoyun

Somewhat unrelated but I used this example in a visualization I made for OpenAI evals: https://github.com/zeno-ml/zeno-evals

cabreraalex avatar Mar 16 '23 19:03 cabreraalex

@andrew-openai Thank you very much for reviewing my eval! Through the waitlist, I already got gpt4-8k access, would it please be possible to gain access to the multimodal gpt4 and to the 32k?|

Thank you so much! <3

SinanAkkoyun avatar Apr 22 '23 11:04 SinanAkkoyun