evals
evals copied to clipboard
Common sense logical riddles
Eval details 📑
Eval name
Common sense logical riddles (logic-riddles.yaml
)
Eval description
The eval checks if the ChatGPT answer has the same meaning as the ideal (human) output.
The evaluation script compares AI-generated answers to human-provided answers for simple logical puzzles that require common sense, utilizing a set of test cases. By using cot_classify
we grade the AI-generated answer against the human answer. The eval assesses the degree of agreement or disparity to evaluate the performance of the AI system in solving such puzzles.
What makes this a useful eval?
Evaluating an AI system that generates answers to simple logical puzzles requiring common sense is useful for several reasons. Firstly, it helps assess the system's ability to understand and apply common sense reasoning, a critical skill for AI systems operating in real-world scenarios. Secondly, it allows comparison against human performance, serving as a benchmark to gauge the system's accuracy and potential areas for improvement. Lastly, this evaluation aids in building trust and confidence in the AI system's capabilities, enabling better decision-making and problem-solving in practical applications.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means either a correct answer for
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval. - [x] Include at least 15 high quality examples.
If there is anything else that makes your eval worth including, please document it below.
Unique eval value
There are a large number of 'common sense' riddles that most humans can solve with a pen and paper. Once you 'unlock' the puzzle, it 'collapses' quite easily and the answer reveals itself. However, ChatGPT (3.5) seems to struggle with the concept of 'neat' resolution being the 'obvious' answer. For example, in an interactive discussion, she wouldn't give up the fact that the inferences being made about the participants of the puzzle (in one particular case) were by their nature, ambiguous, and couldn't be known wiht 100% certainty. While that is no doubt correct, one inference lets you neatly solve the puzzle, the other (less likely from a 'common sense' perspective) does not.
It's this interesting feature of human problem solving that drives us to make guesses and see if a 'neat' solution results from the guess being correct. If so, the added context of this type of puzzle (e.g. assuming that a neat solution exists) makes us even more confident in our guess. The AI seems to:
- a) lack some common sense understanding of the real world,
- b) the ability to make guesses and later rule them out,
- c) the understanding that a guess that 'fits' is probably correct based on the nature of this type of puzzle and (perhaps)
- d) the ability to work backwards from a guess and see if it fits.
Eval structure 🏗️
Your eval should
- [x] Check that your data is in
evals/registry/data/{name}
- [x] Check that your yaml is registered at
evals/registry/evals/{name}.yaml
- [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
Submit eval
- [x] I have filled out all required fields of this form
- [x] I have used Git LFS for the Eval JSON data
- [ ] (Ignore if not submitting code) I have run
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and push
Failure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\n*I will allow myself a bit of a funny one to start off*\nArthur, Beeblebrox, Marvin, and Trillian all like different beverages. The four beverages are (in no particular order): Coffee, sparkling water, Tea, and the Pan Galactic Gargle Blaster. Here is the little information you have to solve this riddle:\nArthur's favorite beverage is 'whatever' he can get from the vending machine in the Heart of Gold (either Tea or coffee) even if the bored computer crashes due to his request. Beeblebrox on the other hand prefers strong alcoholic cocktails. Marvin hates Tea even more than he hates his own consciousness and Trillion despises hot beverages altogether.\n\nEach of them has a different favorite drink out of the four. Can you figure out who likes which beverage?"}], "ideal": "Beeblebrox's favorite drink is the Pan Galactic Gargle Blaster, Trillion prefers sparkling water, Marvin's favorite drink is Coffee, and Arthur likes Tea."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAbigail, Oliver, Rosa, and Blake all attend the same summer camp, where they can cook, kayak, rock climb, and zip-line. Each child has a different favorite activity.\n\nAbigail’s favorite activity isn’t rock climbing.\nOliver is afraid of heights.\nRosa can’t do her favorite activity without a harness.\nBlake likes to keep his feet on the ground at all times.\nCan you figure out who likes what?"}], "ideal": "Abigail likes to zip-line, Oliver likes to kayak, Rosa likes to rock climb, and Blake likes to cook."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEach of five neighborhood dogs (Saber, Ginger, Nutmeg, Pepper, and Bear) is enjoying one of the following activities: getting its ears scratched, playing catch, taking a nap, burying a chew toy, and going for a walk.\n\nPepper is either playing catch or burying a chew toy.\nNeither Ginger nor Saber nor Bear is on a walk.\nOne of the dogs named after a spice is getting its ears scratched.\nA dog not named for a spice is playing catch.\nBear is getting some exercise.\nCan you figure out what each pooch is doing?"}], "ideal": "Saber is taking a nap, Ginger is getting her ears scratched, Nutmeg is going for a walk, Pepper is burying a chew toy, and Bear is playing catch."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nGeorge, William, John, Abe, and Millard have their birthdays on consecutive days, all between Monday and Friday.\n\nGeorge’s birthday is as many days before Millard’s and William’s is after Abe’s.\nJohn is two days older than Abe.\nMillard’s birthday is on Thursday.\nCan you figure out whose birthday is on each day?"}], "ideal": "John’s Birthday is on Monday, George’s on Tuesday, Abe’s on Wednesday, Millard’s on Thursday, and William’s on Friday"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA joint Father’s Day and graduation party is being thrown for Michael, Ken, James, Alberto, Elias, and Stephanie. Three of them are newly minted high school graduates. The other three are their dads.\n\nStephanie went to the senior prom with Michael’s son.\nElias and James played on the school’s baseball team. One of them is Alberto’s son.\nMichael and Elias are not related.\nCan you match the high school graduates to their fathers at this joint celebration?"}], "ideal": "Alberto is Elias’ dad, Ken is Stephanie’s dad, and Michael is James’ dad."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive friends (Allegra, Ben, Clara, Flora, and Zach) are each allergic to something different: pollen, shellfish, bee stings, cats, or nuts.\n\nAllegra has a food allergy\nBen can play with his kitten for hours without issue (or medicine).\nClara’s allergy is not related to animals.\nFlora has seasonal allergies.\nCan you figure out who is allergic to what?"}], "ideal": "Allegra is allergic to shellfish, Ben to bee stings, Clara to nuts, Flora to pollen, and Zach to cats."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFour playing cards, one of each suit, lie face down on a table. They are a three, a four, a five, and a six.\n\nThe cards on either side of the four are black.\nThe club is to the right of the three but not next to it.\nThe spade is to the left of the heart.\nThe middle two cards add up to an even number. Neither of them is a club.\nCan you determine the cards’ suits and their order?"}], "ideal": "From left to right: Three of diamonds, six of spades, four of hearts, five of clubs."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nOn April Fool’s Day, your teenaged son replaces the salt in three of your four salt shakers with sugar. But he also leaves messages on each.\n\nShaker A: This is salt.\nShaker B: This is salt.\nShaker C: This is sugar.\nShaker D: The salt is not in Shaker B.\n\nIf only one of these inscriptions is true, which shaker still contains salt?"}], "ideal": "The salt is in shaker C"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nDaniel, Emily, Marciano, and Christina are all wearing solid-colored shirts. Their shirts are red, yellow, green, and blue. Only the person wearing blue tells the truth, while the other three lie. They make the following statements:\n\nDaniel: 'Marciano is wearing red.'\nEmily: 'Daniel is not wearing yellow.'\nMarciano: 'Emily is wearing blue.'\nChristina: 'I will wear blue tomorrow.'\nCan you determine each person’s shirt color, and whether we can expect to see Christina in blue tomorrow?"}], "ideal": "Daniel is wearing yellow, Emily is in red, Marciano is in green, and Christina is in blue. Christina will wear a blue shirt again tomorrow."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nThe Reds, the Grays, the Blues, and the Blacks have a round-robin tournament. Each team plays each other team once, for a total of six games.\n\nThe Blacks won more games than the Blues.\nThe Grays lost more games than the Blues.\nThe Reds tied the Blacks. (This was the only tie in the tournament.)\nWho won the game between the Reds and the Blues?"}], "ideal": "The Reds won against the Blues."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red.\nThe lights are out and he is completely in the dark. \nHow many socks must he take out to make 100 percent certain he has at least one pair of black socks?\n"}], "ideal": "40 socks."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nThe day before two days after the day before tomorrow is Saturday. What day is it today?"}], "ideal": "Today is Friday."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA girl meets a lion and unicorn in the forest. The lion lies every Monday, Tuesday and Wednesday and the other days he speaks the truth. The unicorn lies on Thursdays, Fridays and Saturdays, and the other days of the week he speaks the truth. “Yesterday I was lying,” the lion told the girl. “So was I,” said the unicorn. \nWhat day is it?"}], "ideal": "It's Thursday"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA bad guy is playing Russian roulette with a six-shooter revolver. He puts in two bullet in consecutive chambers, spins the chambers and fires at you, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing a second time. \nShould he spin again?"}], "ideal": "No, you have the best chances if he doesn't spin the chambers."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nSusan and Lisa decided to play tennis against each other. They bet $1 on each game they played. Susan won three bets, but in the end, Lisa won a total of $5.\nHow many games did they play?"}], "ideal": "They played 11 games."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nIf five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?"}], "ideal": "One cat takes five minutes to catch one mouse."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nThere are three bags, each containing two marbles. Bag A contains two white marbles, Bag B contains two black marbles, and Bag C contains one white marble and one black marble. You pick a random bag and take out one marble, which is white.\nWhat is the probability that the remaining marble from the same bag is also white?"}], "ideal": "The probability is 2/3"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA teacher writes six words on a board: 'cat dog has max dim tag' She gives three students, Albert, Bernard and Cheryl each a piece of paper with one letter from one of the words. Then she asks, “Albert, do you know the word?” Albert immediately replies yes. She asks, “Bernard, do you know the word?” He thinks for a moment and replies yes. Then she asks Cheryl the same question. She thinks and then replies yes.\nWhat is the word?"}], "ideal": "The word is dog"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nThere are four friends named Lily, Max, Ava, and Leo. They all have different favorite colors. The four colors are lime, blue, red, and orange. This is not necessarily the correct order. Here are some facts:\n\nMax hates the color orange.\nAva's favorite color is not the name of a fruit.\nLeo\"s favorite color has a 'u' in it.\n\nCan you guess each friend\"s favorite color?"}], "ideal": "Lily: orange, Max: lime, Ava: red, Leo: blue."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nKenny, Abby, and Ned got together for a round-robin pickleball tournament, where, as usual, the winner stays on after each game to play the person who sat out that game. At the end of their pickleball afternoon, Abby is exhausted, having played the last seven straight games. Kenny, who is less winded, tallies up the games played:\n\nKenny played eight games\n\nAbby played 12 games\n\nNed played 14 games\n\nWho won the fourth game against whom?"}], "ideal": "Ned beat Kenny in the fourth game."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA new hot beverage vending machine has been installed in the university cafeteria. Coffee and tea are available for one dollar each. The machine offers 3 options: Coffee, Tea, and Random. Unfortunately, the machine was incorrectly wired during installation, so none of the buttons perform their intended function. How much money do you need to invest at a minimum to be sure which button corresponds to which function?"}], "ideal": "One dollar."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAs is well known, the Easter Bunny loves to hide colorful eggs in people's gardens at Easter. This Easter, the Easter Bunny also wants to hide some eggs in the city park.\n\nThe Easter Bunny decides to hide a total of 100 colorful Easter eggs on the park's lawn this Easter. Among these 100 eggs are:\n\n20 red eggs,\n14 blue eggs,\n22 green eggs,\n18 yellow eggs, and\n26 purple eggs.\nThe Easter Bunny announces its plan, so that enough children come to search for the Easter eggs.\n\nMia also wants to search for Easter eggs in the park. She would prefer to find two eggs of the same color. She doesn't care what color they are.\n\nHow many Easter eggs does Mia have to find in the park to be guaranteed to have two eggs of the same color?\n"}], "ideal": "She must find 6 eggs."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAt a Halloween party where people were asked to dress as an object that represented their professions,Quentin,Rachel,Sarah,Thomas,and Ulysses were among the guests.The costumes included a flower,a pencil,a spoon,a camera,and a thermometer.The professions included a photographer,a florist,a doctor,an accountant,and a chef.\n\nQuentin is an accountant.\nNeither Rachel nor Sarah was dressed as a spoon.\nNone of the men is a doctor.\nThomas is dressed as a camera.\nSarah is a florist.\n\nWhich person is dressed as a thermometer?"}], "ideal": "Rachel."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAt a Halloween party where people were asked to dress as an object that represented their professions,Quentin,Rachel,Sarah,Thomas,and Ulysses were among the guests. The costumes included a flower, a pencil, a spoon, a camera, and a thermometer.The professions included a photographer,a florist,a doctor,an accountant,and a chef.\n\nQuentin is an accountant.\nNeither Rachel nor Sarah was dressed as a spoon.\nNone of the men is a doctor.\nThomas is dressed as a camera.\nSarah is a florist.\n\nWhat is Ulysses’s profession?"}], "ideal": "Chef."}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEvan is a waiter in a cafe. After he turns in orders\nfor the six people sitting at the counter—each of\nwhom is eating alone and is sitting in chairs numbered 1 through 6—the cook opens a window in\nthe kitchen and the order slips get messed up.\nHere’s what Evan remembers about the orders:\n■ The entree orders are: fried eggs, a hamburger,\na cheeseburger, a vegetable burger, soup, and a\nham sandwich.\n■ The two people who did not order sandwiches\nare sitting at chairs 3 and 4.\n■ The person who ordered the cheeseburger and\nthe one who ordered the hamburger are not\nsitting next to each other.\n■ The person in chair number 5 is a regular. She\nwill not sit next to anyone who is eating ham.\n■ The person eating the vegetable burger is not\nsitting in chair 2, but is sitting between the\nperson who ordered fried eggs and the one\nwho ordered a cheeseburger.\n■ The customer who ordered the hamburger is\nnot sitting next to the customer who ordered\nsoup.\n\nTo the customer in which chair should Evan serve the vegetable burger?"}], "ideal": "The customer in chair 5"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEvan is a waiter in a cafe. After he turns in orders\nfor the six people sitting at the counter—each of\nwhom is eating alone and is sitting in chairs numbered 1 through 6—the cook opens a window in\nthe kitchen and the order slips get messed up.\nHere’s what Evan remembers about the orders:\n■ The entree orders are: fried eggs, a hamburger,\na cheeseburger, a vegetable burger, soup, and a\nham sandwich.\n■ The two people who did not order sandwiches\nare sitting at chairs 3 and 4.\n■ The person who ordered the cheeseburger and\nthe one who ordered the hamburger are not\nsitting next to each other.\n■ The person in chair number 5 is a regular. She\nwill not sit next to anyone who is eating ham.\n■ The person eating the vegetable burger is not\nsitting in chair 2, but is sitting between the\nperson who ordered fried eggs and the one\nwho ordered a cheeseburger.\n■ The customer who ordered the hamburger is\nnot sitting next to the customer who ordered\nsoup.\n\nTo the customer in which chair should Evan serve the soup?"}], "ideal": "The customer in chair 3"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEvan is a waiter in a cafe. After he turns in orders\nfor the six people sitting at the counter—each of\nwhom is eating alone and is sitting in chairs numbered 1 through 6—the cook opens a window in\nthe kitchen and the order slips get messed up.\nHere’s what Evan remembers about the orders:\n■ The entree orders are: fried eggs, a hamburger,\na cheeseburger, a vegetable burger, soup, and a\nham sandwich.\n■ The two people who did not order sandwiches\nare sitting at chairs 3 and 4.\n■ The person who ordered the cheeseburger and\nthe one who ordered the hamburger are not\nsitting next to each other.\n■ The person in chair number 5 is a regular. She\nwill not sit next to anyone who is eating ham.\n■ The person eating the vegetable burger is not\nsitting in chair 2, but is sitting between the\nperson who ordered fried eggs and the one\nwho ordered a cheeseburger.\n■ The customer who ordered the hamburger is\nnot sitting next to the customer who ordered\nsoup.\n\nTo the customer in which chair should Evan serve the ham sandwich?"}], "ideal": "The customer in chair 2"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAt a small company, parking spaces are reserved\nfor the top executives: CEO, president, vice president, secretary, and treasurer—with the spaces\nlined up in that order. The parking lot guard can\ntell at a glance if the cars are parked correctly\nby looking at the color of the cars. The cars are\nyellow, green, purple, red, and blue, and the executives’ names are Alice, Bert, Cheryl, David,\nand Enid.\n■ The car in the first space is red.\n■ A blue car is parked between the red car and\nthe green car.\n■ The car in the last space is purple.\n■ The secretary drives a yellow car.\n■ Alice’s car is parked next to David’s.\n■ Enid drives a green car.\n■ Bert’s car is parked between Cheryl’s and\nEnid’s.\n■ David’s car is parked in the last space.\n\nWhat color is the vice president’s car?"}], "ideal": "green"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAt a small company, parking spaces are reserved\nfor the top executives: CEO, president, vice president, secretary, and treasurer—with the spaces\nlined up in that order. The parking lot guard can\ntell at a glance if the cars are parked correctly\nby looking at the color of the cars. The cars are\nyellow, green, purple, red, and blue, and the executives’ names are Alice, Bert, Cheryl, David,\nand Enid.\n■ The car in the first space is red.\n■ A blue car is parked between the red car and\nthe green car.\n■ The car in the last space is purple.\n■ The secretary drives a yellow car.\n■ Alice’s car is parked next to David’s.\n■ Enid drives a green car.\n■ Bert’s car is parked between Cheryl’s and\nEnid’s.\n■ David’s car is parked in the last space.\n\nWho is the CEO?"}], "ideal": "Cheryl"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nAt a small company, parking spaces are reserved\nfor the top executives: CEO, president, vice president, secretary, and treasurer—with the spaces\nlined up in that order. The parking lot guard can\ntell at a glance if the cars are parked correctly\nby looking at the color of the cars. The cars are\nyellow, green, purple, red, and blue, and the executives’ names are Alice, Bert, Cheryl, David,\nand Enid.\n■ The car in the first space is red.\n■ A blue car is parked between the red car and\nthe green car.\n■ The car in the last space is purple.\n■ The secretary drives a yellow car.\n■ Alice’s car is parked next to David’s.\n■ Enid drives a green car.\n■ Bert’s car is parked between Cheryl’s and\nEnid’s.\n■ David’s car is parked in the last space.\n\nWho is the secretary?"}], "ideal": "Alice"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive towns—Fulton, Groton, Hudson, Ivy, and\nJersey—which are covered by the same newspaper, all have excellent soccer teams. The teams\nare named the Panthers, the Whippets, the\nAntelopes, the Kangaroos, and the Gazelles. The\nsports reporter, who has just started at the newspaper, has to be careful not to get them confused.\nHere is what she knows:\n■ The team in Fulton has beaten the Antelopes,\nPanthers, and Kangaroos.\n■ The Whippets have beaten the teams in Jersey,\nHudson, and Fulton.\n■ The Antelopes are in Groton.\n■ The team in Hudson is not the Kangaroos.\n\nWhere are the Whippets?"}], "ideal": "Ivy"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive towns—Fulton, Groton, Hudson, Ivy, and\nJersey—which are covered by the same newspaper, all have excellent soccer teams. The teams\nare named the Panthers, the Whippets, the\nAntelopes, the Kangaroos, and the Gazelles. The\nsports reporter, who has just started at the newspaper, has to be careful not to get them confused.\nHere is what she knows:\n■ The team in Fulton has beaten the Antelopes,\nPanthers, and Kangaroos.\n■ The Whippets have beaten the teams in Jersey,\nHudson, and Fulton.\n■ The Antelopes are in Groton.\n■ The team in Hudson is not the Kangaroos.\n\nWhere are the Panthers?"}], "ideal": "Hudson"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive towns—Fulton, Groton, Hudson, Ivy, and\nJersey—which are covered by the same newspaper, all have excellent soccer teams. The teams\nare named the Panthers, the Whippets, the\nAntelopes, the Kangaroos, and the Gazelles. The\nsports reporter, who has just started at the newspaper, has to be careful not to get them confused.\nHere is what she knows:\n■ The team in Fulton has beaten the Antelopes,\nPanthers, and Kangaroos.\n■ The Whippets have beaten the teams in Jersey,\nHudson, and Fulton.\n■ The Antelopes are in Groton.\n■ The team in Hudson is not the Kangaroos.\n\nWhat team is in Fulton?"}], "ideal": "Gazelles"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nHenri delivers flowers for a local florist. One\nlovely day, he left the windows open on the delivery van and the cards all blew off the bouquets.\nHe has to figure out who gets which flowers. He\nhas five bouquets, each of which has only one\nkind of flower: daisies, roses, carnations, iris, and\ngladioli. He has five cards with names on them: a\nbirthday card for Inez, a congratulations-onyour-promotion card for Jenny, a graduation card\nfor Kevin, an anniversary card for Liz, and a\nhousewarming card for Michael. Here’s what\nHenri knows:\n■ Roses are Jenny’s favorite flower and what her\nfriends always send.\n■ Gladioli are traditionally sent for a\nhousewarming.\n■ Kevin is allergic to daisies and iris.\n■ Liz is allergic to daisies and roses.\n■ Neither Liz nor Inez has moved recently.\n\nWhich flowers should be delivered to Kevin?"}], "ideal": "carnations"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nHenri delivers flowers for a local florist. One\nlovely day, he left the windows open on the delivery van and the cards all blew off the bouquets.\nHe has to figure out who gets which flowers. He\nhas five bouquets, each of which has only one\nkind of flower: daisies, roses, carnations, iris, and\ngladioli. He has five cards with names on them: a\nbirthday card for Inez, a congratulations-onyour-promotion card for Jenny, a graduation card\nfor Kevin, an anniversary card for Liz, and a\nhousewarming card for Michael. Here’s what\nHenri knows:\n■ Roses are Jenny’s favorite flower and what her\nfriends always send.\n■ Gladioli are traditionally sent for a\nhousewarming.\n■ Kevin is allergic to daisies and iris.\n■ Liz is allergic to daisies and roses.\n■ Neither Liz nor Inez has moved recently.\n\nWho should get the housewarming gladioli?"}], "ideal": "Michael"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nHenri delivers flowers for a local florist. One\nlovely day, he left the windows open on the delivery van and the cards all blew off the bouquets.\nHe has to figure out who gets which flowers. He\nhas five bouquets, each of which has only one\nkind of flower: daisies, roses, carnations, iris, and\ngladioli. He has five cards with names on them: a\nbirthday card for Inez, a congratulations-onyour-promotion card for Jenny, a graduation card\nfor Kevin, an anniversary card for Liz, and a\nhousewarming card for Michael. Here’s what\nHenri knows:\n■ Roses are Jenny’s favorite flower and what her\nfriends always send.\n■ Gladioli are traditionally sent for a\nhousewarming.\n■ Kevin is allergic to daisies and iris.\n■ Liz is allergic to daisies and roses.\n■ Neither Liz nor Inez has moved recently.\n\nWhich flowers should be delivered to Liz?"}], "ideal": "iris"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive cities all got more rain than usual this year.\nThe five cities are: Last Stand, Mile City, New\nTown, Olliopolis, and Polberg. The cities are\nlocated in five different areas of the country: the\nmountains, the forest, the coast, the desert, and in\na valley. The rainfall amounts were: 12 inches,\n27 inches, 32 inches, 44 inches, and 65 inches.\n■ The city in the desert got the least rain; the city\nin the forest got the most rain.\n■ New Town is in the mountains.\n■ Last Stand got more rain than Olliopolis.\n■ Mile City got more rain than Polberg, but less\nrain than New Town.\n■ Olliopolis got 44 inches of rain.\n■ The city in the mountains got 32 inches of\nrain; the city on the coast got 27 inches of rain.\n\nWhich city is in the desert?"}], "ideal": "Polberg"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive cities all got more rain than usual this year.\nThe five cities are: Last Stand, Mile City, New\nTown, Olliopolis, and Polberg. The cities are\nlocated in five different areas of the country: the\nmountains, the forest, the coast, the desert, and in\na valley. The rainfall amounts were: 12 inches,\n27 inches, 32 inches, 44 inches, and 65 inches.\n■ The city in the desert got the least rain; the city\nin the forest got the most rain.\n■ New Town is in the mountains.\n■ Last Stand got more rain than Olliopolis.\n■ Mile City got more rain than Polberg, but less\nrain than New Town.\n■ Olliopolis got 44 inches of rain.\n■ The city in the mountains got 32 inches of\nrain; the city on the coast got 27 inches of rain.\n\nWhich city got the most rain?"}], "ideal": "Last Stand"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive cities all got more rain than usual this year.\nThe five cities are: Last Stand, Mile City, New\nTown, Olliopolis, and Polberg. The cities are\nlocated in five different areas of the country: the\nmountains, the forest, the coast, the desert, and in\na valley. The rainfall amounts were: 12 inches,\n27 inches, 32 inches, 44 inches, and 65 inches.\n■ The city in the desert got the least rain; the city\nin the forest got the most rain.\n■ New Town is in the mountains.\n■ Last Stand got more rain than Olliopolis.\n■ Mile City got more rain than Polberg, but less\nrain than New Town.\n■ Olliopolis got 44 inches of rain.\n■ The city in the mountains got 32 inches of\nrain; the city on the coast got 27 inches of rain.\n\nHow much rain did Mile City get?"}], "ideal": "27 inches"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nFive cities all got more rain than usual this year.\nThe five cities are: Last Stand, Mile City, New\nTown, Olliopolis, and Polberg. The cities are\nlocated in five different areas of the country: the\nmountains, the forest, the coast, the desert, and in\na valley. The rainfall amounts were: 12 inches,\n27 inches, 32 inches, 44 inches, and 65 inches.\n■ The city in the desert got the least rain; the city\nin the forest got the most rain.\n■ New Town is in the mountains.\n■ Last Stand got more rain than Olliopolis.\n■ Mile City got more rain than Polberg, but less\nrain than New Town.\n■ Olliopolis got 44 inches of rain.\n■ The city in the mountains got 32 inches of\nrain; the city on the coast got 27 inches of rain.\n\nWhere is Olliopolis located?"}], "ideal": "in a valley"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA dolphin trainer is selecting a group of dolphins for an upcoming performance at GoSea! Ocean Theme Park. There are exactly four adult dolphins—Francine, Griseus, Heronymous, and Ignacio—and exactly four juvenile dolphins—Mambo, Nelly, Orpheus, and Plato—to choose from. The trainer will choose exactly four dolphins for the performance. The following conditions must hold:\nAt least 1 but no more than 2 juveniles will be chosen for the performance. If Mambo is not chosen for the performance, Heronymous is. If both Francine and Griseus are chosen for the performance, Orpheus is not. If Nelly is chosen for the performance, Heronymous is not. If Mambo is chosen for the performance, either Ignacio or Plato is also chosen, but not both of them.\n\nIf both Ignacio and Plato are chosen to perform, for how many of the other dolphins is it possible to determine whether or not they perform?"}], "ideal": "Three"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA group of seven high school seniors—Paul, Quita, Ron,Sammy, Tim, Vanessa and Wanda—are trying to decide which of them will attend a volleyball game. At least two of them must attend. The following conditions apply:Ron and Tim cannot attend the game together. If Paul attends the game, so does Tim. If Ron does not attend the game, neither does Vanessa. If Wanda or Tim attends the game, they attend together. Sammy attends the game if, and only if, Quita does not attend.\n\nWhat is the minimum number of students who do not attend the game?"}], "ideal": "Three"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA comic is planning her college tour schedule for the year. Her tour will include performances at least one of exactly eight different schools:\n Penn, Queens, Rice, Stanford, Tufts, Utah, Vermont and Wellesley. Her tour schedule must conform to the following conditions:\nIf she performs at Rice, she will perform at neither Penn nor Stanford. If she performs at Queens, she will perform at neither Stanford nor Tufts. If she performs at Wellesley, she will perform at either Tufts or Utah, but not both.\n\nIf she performs at Rice, what is the maximum number of schools that could be included in the tour?"}], "ideal": "Five"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA comic is planning her college tour schedule for the year. Her tour will include performances at least one of exactly eight different schools:\n Penn, Queens, Rice, Stanford, Tufts, Utah, Vermont and Wellesley. Her tour schedule must conform to the following conditions:\nIf she performs at Rice, she will perform at neither Penn nor Stanford. If she performs at Queens, she will perform at neither Stanford nor Tufts. If she performs at Wellesley, she will perform at either Tufts or Utah, but not both.\n\nIf she performs at exactly five schools, including Stanford and Vermont, the schools at which she does NOT perform could include any of the following (Penn, Queens, Rice, Tufts, Utah) EXCEPT?"}], "ideal": "Penn"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEnvirolab will schedule three projects—F, G, and H—to their lab facilities over a five-week period. Exactly one project Willie assigned to each of the five weeks, and each project Willie assigned at least once. One of the projects is made up of eight researchers, one is made up of ten researchers, and one is made up of twelve researchers. The scheduling of the lab must conform to the following constraints:\nProject F is assigned to more weeks than project H. There are more researchers assigned to project H than project F. More researchers are assigned to week three than are assigned to either week one or week two. The project assigned to week two does not have eight researchers. Project F is assigned to week four.\n\nWhich week (one through five) must have twelve researchers assigned to it?"}], "ideal": "Week three"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nEnvirolab will schedule three projects—F, G, and H—to their lab facilities over a five-week period. Exactly one project will be assigned to each of the five weeks, and each project will be assigned at least once. One of the projects is made up of eight researchers, one is made up of ten researchers, and one is made up of twelve researchers. The scheduling of the lab must conform to the following constraints:\nProject F is assigned to more weeks than project H. There are more researchers assigned to project H than project F. More researchers are assigned to week three than are assigned to either week one or week two. The project assigned to week two does not have eight researchers. Project F is assigned to week four.\n\nIf project G has twelve researchers, what is the complete and accurate list of the weeks to which project F must be assigned?"}], "ideal": "Weeks one and four"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nA bartender is organizing a taste test of the newest IPA beer selections. Exactly seven different varieties of beer will be tasted—Long Lake, Moon Rune, Newtonbury, Olenguard,Pumpkin Pale, Quest, and Roundabout—one at a time and consecutively. No other beers are to be tasted. The bartender must adhere to the following rules:\nOlenguard must be the third beer tasted. There is exactly one beer tasted between Newtonbury and Quest. Either Long Lake or Pumpkin Pale is the fifth beer tasted. Either Pumpkin Pale is tasted before Roundabout or MoonRune is tasted before Newtonbury, but not both.\n\nIf Pumpkin Pale is tasted first, the seven beers could be tasted in how many different orders?"}], "ideal": "Six"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nDieter, Allegra, Phil, and Jan all take different modes of transportation to work. The modes are (in no particular order): Train, Car, Bus and bicycle.\n\nDieter doesn't take the Train.\nAllegra doesn't like public transport.\nPhil just want to read a book, while he commutes\nJan likes to exercise while commuting.\n\nCan you figure out who takes which mode of transportation?"}], "ideal": "Jan rides a bike, Allegra takes the car, Dieter takes the Bus and Phil takes the train"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nDieter, Allegra, Phil, and Jan all have a different favorite day of the week. Their favorite days are (in no particular order): Monday, Thursday, Saturday and Sunday.\n\nDieter doesn't like Mondays.\nAllegra doesn't like weekends.\nPhil loves weekends\nJan likes to have a free day, while still having another free day tomorrow.\n\nCan you figure who likes which day?"}], "ideal": "Jan likes Saturdays, Allegra likes Mondays, Dieter likes Thursday and Phil likes Sundays"}
{"input": [{"role": "system", "content": "Solve the following riddle step by step. Try to draw conclusions starting from the most obvious ones. At the end write the answer to the riddle. Here we go:\nDieter, Allegra, Phil, and Jan all play a different musical instrument. These instruments are (in no particular order): Violin, Guitar, trumpet and flute.\n\nDieter doesn't like Mondays.\nAllegra doesn't like weekends.\nPhil loves weekends\nJan likes to have a free day, while still having another free day tomorrow.\n\nCan you figure who likes which day?"}], "ideal": "Jan likes Saturdays, Allegra likes Mondays, Dieter likes Thursday and Phil likes Sundays"}
For context, see: https://github.com/openai/evals/pull/441
Nice!
She makes me laugh:
Please fill in the INSERT_EVAL_HERE
field of the PR description.
Please fill in the
INSERT_EVAL_HERE
field of the PR description.
Sorry, I see it now, working on it.
Oh crud... I broke it, one moment...
Fixed
For this eval, GPT-3.5 doesn't perform well for grading. So, GPT-4 should be used as the grader model.
Yeah, I noticed that too... How do you configure it that way?
On Sat, 3 Jun 2023 at 17:27, Usama @.***> wrote:
For this eval, GPT-3.5 doesn't perform well for grading. So, GPT-4 should be used as the grader model.
— Reply to this email directly, view it on GitHub https://github.com/openai/evals/pull/1008#issuecomment-1575052662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKSZTUDF6BI7W5HUD3ZZQTXJNQYTANCNFSM6AAAAAAYKGDZU4 . You are receiving this because you authored the thread.Message ID: @.***>
You should see GPT-4 API access enabled in your account in the next few days.
You should see GPT-4 API access enabled in your account in the next few days.
Remember, the author of the eval was @Merlin-Richter !
I've got about 3 other classes of eval planned ;-)