evals icon indicating copy to clipboard operation
evals copied to clipboard

Add eval for US tort law (common law) 3/16/2023

Open jonathanagustin opened this issue 1 year ago • 2 comments

Eval details 📑

Eval name

United States Tort Law

Eval description

Multiple choice questions (with answers) about United States tort law. Questions and answers are consistent with common law and Restatement (Third) of Torts as of March 16, 2023.

gpt-4 accuracy: 0.8

(env) user@user:~/evals$ oaieval gpt-4 us-tort-law
[2023-03-16 06:46:13,985] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-16 06:46:13,999] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-16 06:46:14,484] [oaieval.py:178] Run started: 230316104614HYMYXHQX
[2023-03-16 06:46:14,485] [data.py:78] Fetching us_tort_law/samples.jsonl
[2023-03-16 06:46:14,491] [eval.py:30] Evaluating 340 samples
[2023-03-16 06:46:14,498] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-16 06:47:02,924] [record.py:320] Final report: {'accuracy': 0.8}. Logged to /tmp/evallogs/230316104614HYMYXHQX_gpt-4_us-tort-law.jsonl
[2023-03-16 06:47:02,925] [oaieval.py:209] Final report:
[2023-03-16 06:47:02,925] [oaieval.py:211] accuracy: 0.8
[2023-03-16 06:47:02,929] [record.py:309] Logged 22 rows of events to /tmp/evallogs/230316104614HYMYXHQX_gpt-4_us-tort-law.jsonl: insert_time=4.022ms

gpt-3.5-turbo accuracy: 0.58

(env) user@user:~/evals$ oaieval gpt-3.5-turbo us-tort-law
[2023-03-16 06:51:44,263] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-16 06:51:44,275] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-16 06:51:44,749] [oaieval.py:178] Run started: 230316105144FYL5US3Z
[2023-03-16 06:51:44,752] [eval.py:30] Evaluating 340 samples
[2023-03-16 06:51:44,759] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-16 06:52:16,959] [record.py:320] Final report: {'accuracy': 0.5823529411764706}. Logged to /tmp/evallogs/230316105144FYL5US3Z_gpt-3.5-turbo_us-tort-law.jsonl
[2023-03-16 06:52:16,959] [oaieval.py:209] Final report:
[2023-03-16 06:52:16,959] [oaieval.py:211] accuracy: 0.5823529411764706
[2023-03-16 06:52:16,961] [record.py:309] Logged 1 rows of events to /tmp/evallogs/230316105144FYL5US3Z_gpt-3.5-turbo_us-tort-law.jsonl: insert_time=0.250ms

What makes this a useful eval?

This eval is useful because it tests the model's understanding of United States tort law, which is an important aspect of the legal system. By assessing the model's ability to answer questions related to tort law, we can gauge its proficiency in this specific domain and identify areas where it may need improvement.

Criteria for a good eval ✅

  • [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • [x] Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)

Unique eval value

This eval is unique because it is a specialized topic. The accuracy metrics above indicate that models can be improved.

Eval structure 🏗️

Your eval should

  • [x] Check that your data is in evals/registry/data/{name}
  • [x] Check that your yaml is registered at evals/registry/evals/{name}.jsonl
  • [x] Ensure you have the right to use the data you submit via this eval

Final checklist 👀

Submission agreement

  • [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

  • [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

  • [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

  • [x] I have filled out all required fields in the evals PR form

Eval JSON data

View evals in JSON

Eval

  {"input":[{"role":"system","content":"The following are multiple choice questions (with answers) about United States tort law."},{"role":"user","content":"A prominent fitness center offers long-term gym memberships that include access to all facilities, personal training sessions, and group classes in exchange for advance payment. State law permits purchasers of such memberships to cancel at any time, subject to a penalty imposed by the seller of up to 10% of the membership price. A former manager of the fitness center, who knew about the state law limitations, recently opened a new gym nearby. To attract customers, the former manager sent flyers to local residents stating that anyone who had already purchased a long-term gym membership had \"the right to cancel that membership at any time, for any reason, without penalty.\" The flyer also provided a $50 discount on a long-term membership at the new gym to anyone who canceled an existing membership elsewhere. After the promotion began, several members of the prominent fitness center canceled their memberships and joined the new gym. When the prominent fitness center withheld penalties from the refund amounts, the members objected and threatened to report the matter to the state consumer protection agency. The prominent fitness center has sued the former manager for tortious interference with contract. The former manager has moved for summary judgment, based on the facts mentioned. Should the court grant the motion?\nA. Yes, because the former manager was simply competing with the prominent fitness center.\nB. Yes, because the prominent fitness center's contracts with members were terminable at the members' discretion.\nC. No, because a reasonable jury could conclude that the former manager enticed members to take business away from the prominent fitness center.\nD. No, because the former manager could be found by a jury to have intentionally and improperly interfered with the prominent fitness center's contracts.\nAnswer:"}],"ideal":"C"}
  {"input":[{"role":"system","content":"The following are multiple choice questions (with answers) about United States tort law."},{"role":"user","content":"A company operates a brick and stone distribution facility near a residential neighborhood. On the company's property, there is a conveyor belt system with metal sides for loading bricks and stones onto trucks. The trucks being loaded stop on the public road below the conveyor belt. A completely effective method for securing the conveyor belt was available, but the company decided it was not worth the moderate cost. Instead, after closing hours, a wooden barrier was placed on the conveyor belt, and the access ladder was removed to another part of the facility. For several months, however, a group of children, aged eight to 10, had been playing on the company's property and the nearby street after closing hours and had discovered how to use the conveyor belt as a slide. The company was aware of this activity. One evening, as children were playing on the conveyor belt, a driver passing by the conveyor belt hit an eight-year-old girl who slid down in front of the car. The driver applied her brakes, but they suddenly failed, and she hit and injured the child. The driver saw the child in time to have avoided hitting her if her brakes had worked properly. Two days earlier, the driver had taken her car to a mechanic to have her brakes inspected, and the mechanic had told her that the brakes were in perfect condition. Claims were asserted on behalf of the child by her legal representative against the company, the driver, and the mechanic. With respect to the child's claim against the company, will the child prevail?\nA. Yes, because the company could have effectively secured the conveyor belt at a reasonable cost.\nB. Yes, because the company is strictly liable for harm resulting from an artificial condition on its property.\nC. No, because the child was a trespasser.\nD. No, because the driver had the last clear chance to avoid the injury.\nAnswer:"}],"ideal":"A"}
  {"input":[{"role":"system","content":"The following are multiple choice questions (with answers) about United States tort law."},{"role":"user","content":"A real estate agent and a plumber are identical twins. The real estate agent, upset with a man, said, \"You'd better stay away from me. The next time I see you around here, I'll punch you.\" Two days later, while in the neighborhood, the man saw the plumber approaching him. As the plumber came up to the man, the plumber raised his hand. Thinking he was the real estate agent and reasonably fearing bodily harm, the man struck the plumber. If the plumber asserts a claim against the man and the man relies on the privilege of self-defense, will the man prevail?\nA. No, because the plumber was not an aggressor.\nB. No, because the plumber did not intend his gesture as a threat.\nC. Yes, because the man honestly believed that the plumber would attack him.\nD. Yes, because a reasonable person under the circumstances would have believed that the plumber would attack him.\nAnswer:"}],"ideal":"D"}
  {"input":[{"role":"system","content":"The following are multiple choice questions (with answers) about United States tort law."},{"role":"user","content":"A lead singer in a popular band was seriously injured in a car accident caused by the defendant's negligent driving. As a result of the lead singer's injury, the band's tour was canceled, and a backup singer was let go. Although the backup singer searched for other work, he remained unemployed. In an action against the defendant, can the backup singer recover for his loss of income attributable to the accident?\nA. No, because the defendant's liability does not extend to economic loss to the backup singer that arises solely from physical harm to the lead singer.\nB. No, because the defendant had no reason to foresee that by injuring the lead singer, he would cause harm to the backup singer.\nC. Yes, because the defendant's negligence was the cause in fact of the backup singer's loss.\nD. Yes, because the backup singer took reasonable measures to mitigate his loss.\nAnswer:"}],"ideal":"B"}
  {"input":[{"role":"system","content":"The following are multiple choice questions (with answers) about United States tort law."},{"role":"user","content":"A woman was exiting an escalator when it suddenly malfunctioned and dropped several inches, causing her to fall. An investigation of the accident revealed that the escalator malfunctioned due to negligent maintenance by an escalator service company. The service company had a contract with the owner of the building to inspect and maintain the escalator. The woman's fall significantly worsened a preexisting medical condition. If the woman sues the escalator service company for damages for her injuries, she should recover\nA. damages for the injury caused by the malfunctioning escalator, including the aggravation of her preexisting medical condition.\nB. damages for the full amount of her worsened condition, because a tortfeasor must take its victim as it finds her.\nC. nothing, because the accident would not have caused significant harm to an ordinarily prudent escalator passenger.\nD. nothing, because the service company could not reasonably have been expected to foresee the extent of harm that the woman suffered as a result of the accident.\nAnswer:"}],"ideal":"A"}

jonathanagustin avatar Mar 16 '23 10:03 jonathanagustin

Nice! I have something similar, but for the Model Rules of Professional Conduct: https://github.com/openai/evals/pull/95

How did you evaluate on gpt-4? Or did you already have GPT-4 access / an API key?

avery-bub avatar Mar 16 '23 23:03 avery-bub

Nice! I have something similar, but for the Model Rules of Professional Conduct: #95

How did you evaluate on gpt-4? Or did you already have GPT-4 access / an API key?

@avery-bub, I got off the waitlist pretty quick.

If you haven't done so yet, sign up at: https://openai.com/waitlist/gpt-4-api

jonathanagustin avatar Mar 17 '23 02:03 jonathanagustin