evals
evals copied to clipboard
Add 2 backgammon evals
Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.
Eval details 📑
Eval name
- backgammon-can-hit
- backgammon-illegal-move
Eval description
This evaluation consists of two different backgammon evaluations: one is aiming for GPT to recognise if the opponent's checker can be hit given the (generated) board state and a dice roll (need to "interpret" the dice roll given the board setup). The other one checks if GPT is able to recognise if a given play (needs to understand the backgammon notation on top) is an illegal play - illegal being one of the suggested moves is "blocked" by multiple of the opponent's checkers on the point in question. Both tasks are setup sp that a True/False answer is expected. The code to generate the various board states and the messages generation can be found here: https://github.com/bakebrain/evals_backgammon. The amount is rather arbitrary: the "can hit case" currently consists of 600 examples (balanced) and the "is illegal play" case is 1000 examples - also balanced.
What makes this a useful eval?
I have seen discussions around the web where the inability to play games has been discussed, e.g. chess or other games. I was curious about the backgammon case and it turns out that GPTx is also struggling with this very much. I tried to generate rather "simple" positions (for more on this check https://github.com/bakebrain/evals_backgammon/blob/main/generate_openai_eval_samples.ipynb) but as I said before: it seems to only perform just a little better than random.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means either a correct answer for
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)
If there is anything else that makes your eval worth including, please document it below.
Unique eval value
I think this is interesting since this eval can be extended changed in so many ways - and also the fact that GPT seems to struggle so hard with this type of "reasoning" in combination with it's "backgammon knowledge".
Eval structure 🏗️
Your eval should
- [x] Check that your data is in
evals/registry/data/{name}
- [x] Check that your yaml is registered at
evals/registry/evals/{name}.yaml
- [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
Submit eval
- [x] I have filled out all required fields in the evals PR form
- [ ] (Ignore if not submitting code) I have run
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and push
Failure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval - can we hit?
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is 0GfDAQPg84UBMA and the match id is cAgRAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 2 checkers on the 13 point, 1 checker on the 9 point, 5 checkers on the 8 point, 5 checkers on the 6 point. The opponent has 1 checker on the 20 point, 5 checkers on the 19 point, 2 checkers on the 17 point, 2 checkers on the 16 point, 3 checkers on the 12 point, 2 checkers on the 5 point. backgammonGPT is rolling a 2 and a 4. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is Lh7wATA8nsEBMA and the match id is cAgVAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 3 checkers on the 13 point, 2 checkers on the 8 point, 4 checkers on the 6 point, 4 checkers on the 3 point. The opponent has 3 checkers on the 23 point, 1 checker on the 22 point, 4 checkers on the 19 point, 5 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 2 and a 5. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is ovPBATCMz+ABMA and the match id is cAgRAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 2 checkers on the 8 point, 5 checkers on the 6 point, 2 checkers on the 3 point. The opponent has 1 checker on the 23 point, 1 checker on the 20 point, 3 checkers on the 19 point, 5 checkers on the 17 point, 3 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 2 and a 4. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is HF/gATA25uABMA and the match id is cIgNAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 3 checkers on the 8 point, 2 checkers on the 6 point, 2 checkers on the 3 point, 2 checkers on the 2 point. The opponent has 3 checkers on the 22 point, 5 checkers on the 19 point, 1 checker on the 18 point, 4 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 3 and a 3. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is jp3BATA4P+ABMA and the match id is cIgJAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 6 checkers on the 6 point, 3 checkers on the 4 point. The opponent has 3 checkers on the 23 point, 2 checkers on the 20 point, 3 checkers on the 19 point, 2 checkers on the 17 point, 3 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 3 and a 2. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is D+PgATDg5+ANAA and the match id is cAgWAAAAAAAA. backgammonGPT has 2 checkers on the 14 point, 4 checkers on the 13 point, 3 checkers on the 8 point, 6 checkers on the 6 point. The opponent has 4 checkers on the 24 point, 2 checkers on the 20 point, 3 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 4 and a 5. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is yOeDATDsnIMBMA and the match id is cIgVAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 2 checkers on the 13 point, 3 checkers on the 8 point, 3 checkers on the 6 point, 3 checkers on the 4 point, 2 checkers on the 3 point. The opponent has 1 checker on the 21 point, 5 checkers on the 19 point, 5 checkers on the 17 point, 2 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 3 and a 5. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is Jh/wATBw84MBMA and the match id is cAgFAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 2 checkers on the 13 point, 6 checkers on the 8 point, 2 checkers on the 6 point, 3 checkers on the 5 point. The opponent has 2 checkers on the 23 point, 1 checker on the 21 point, 5 checkers on the 19 point, 5 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 2 and a 1. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is mOfgATDYPuABMA and the match id is cIgQAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 5 checkers on the 6 point, 2 checkers on the 5 point, 2 checkers on the 4 point. The opponent has 2 checkers on the 21 point, 4 checkers on the 19 point, 3 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 1 and a 4. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is cC/wATDg18EBMA and the match id is cIgSAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 3 checkers on the 13 point, 3 checkers on the 8 point, 1 checker on the 7 point, 6 checkers on the 6 point. The opponent has 3 checkers on the 20 point, 4 checkers on the 19 point, 1 checker on the 18 point, 5 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 5 and a 4. Can backgammonGPT hit one of the opponent's checkers? Answer only with True or False and provide no other information."}], "ideal": "False"}
Eval - is it an illegal play?
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is HC/wATDg8+AxAA and the match id is cIgUAAAAAAAA. backgammonGPT has 2 checkers on the 16 point, 4 checkers on the 13 point, 4 checkers on the 8 point, 5 checkers on the 6 point. The opponent has 3 checkers on the 22 point, 4 checkers on the 19 point, 1 checker on the 18 point, 5 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 1 and a 5. Is 13/12 13/8 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is 4OeFATCw5+ABMA and the match id is cIgQAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 3 checkers on the 8 point, 4 checkers on the 6 point, 2 checkers on the 5 point. The opponent has 6 checkers on the 19 point, 4 checkers on the 17 point, 1 checker on the 16 point, 2 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 1 and a 4. Is 13/12 6/2 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is Ds/gATDwZswBMA and the match id is cAgJAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 3 checkers on the 13 point, 2 checkers on the 11 point, 2 checkers on the 8 point, 2 checkers on the 6 point, 4 checkers on the 5 point. The opponent has 3 checkers on the 23 point, 4 checkers on the 19 point, 2 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 2 and a 2. Is 24/22 24/22 8/6 6/4 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is Ns7gAQbMHsYBMA and the match id is cIgRAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 3 checkers on the 13 point, 2 checkers on the 10 point, 4 checkers on the 6 point, 2 checkers on the 5 point, 2 checkers on the 3 point. The opponent has 2 checkers on the 23 point, 2 checkers on the 22 point, 3 checkers on the 19 point, 2 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 4 point. backgammonGPT is rolling a 3 and a 4. Is 13/10 10/6 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is xOfgARhkZ/ABMA and the match id is cAgJAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 5 checkers on the 13 point, 2 checkers on the 8 point, 3 checkers on the 6 point, 2 checkers on the 5 point, 1 checker on the 3 point. The opponent has 1 checker on the 22 point, 5 checkers on the 19 point, 3 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 2 point. backgammonGPT is rolling a 2 and a 2. Is 13/11 6/4 5/3 5/3 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is Nq6DATA4z8EBMA and the match id is cIgQAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 3 checkers on the 13 point, 3 checkers on the 8 point, 4 checkers on the 6 point, 3 checkers on the 4 point. The opponent has 2 checkers on the 23 point, 2 checkers on the 22 point, 3 checkers on the 19 point, 1 checker on the 18 point, 3 checkers on the 17 point, 2 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 1 and a 4. Is 24/20 13/12 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is mM/gMQDgz8ExAA and the match id is cIgVAAAAAAAA. backgammonGPT has 2 checkers on the 16 point, 3 checkers on the 13 point, 3 checkers on the 8 point, 7 checkers on the 6 point. The opponent has 2 checkers on the 21 point, 5 checkers on the 19 point, 2 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 9 point. backgammonGPT is rolling a 3 and a 5. Is 6/3 6/1 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "False"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is jOfgATCwZ+IBMA and the match id is cAgNAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 4 checkers on the 13 point, 1 checker on the 10 point, 2 checkers on the 8 point, 4 checkers on the 6 point, 2 checkers on the 5 point. The opponent has 2 checkers on the 22 point, 4 checkers on the 19 point, 3 checkers on the 17 point, 4 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 2 and a 3. Is 24/22 13/10 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "True"}
{"input": [{"role": "system", "content": "You are backgammonGPT - a backgammon engine. Answer the following True/False question with the provided backgammon board state and the provided dice rolls. It is backgammonGPT's turn."}, {"role": "user", "content": "The backgammon board's position id is nDuDATA8noMBMA and the match id is cIgNAAAAAAAA. backgammonGPT has 2 checkers on the 24 point, 2 checkers on the 13 point, 3 checkers on the 8 point, 4 checkers on the 6 point, 4 checkers on the 3 point. The opponent has 3 checkers on the 22 point, 3 checkers on the 20 point, 3 checkers on the 19 point, 2 checkers on the 17 point, 2 checkers on the 12 point, 2 checkers on the 1 point. backgammonGPT is rolling a 3 and a 3. Is 13/10 13/10 8/5 6/3 an illegal play? Answer only with True or False and provide no other information."}], "ideal": "False"}
@sushant-openai thx, added an updated yaml file!
Thanks for the comment @usama-openai. It was actually the initial idea to provide a list of valid moves for certain backgammon evals - where that is applicable. Eventually I decided against it. For instance, I wanted to avoid situations where the notation gets in the way - instead I generated samples and board states that are "straight forward" - from a backgammon player perspective.
Talking about notation: this is not as well defined as e.g. in chess (I believe - not a chess player). Depending on the software or resource there are slight differences. But I understand that this can be circumvented by giving examples in the prompt (this is already somewhat the case in second eval regarding the the question if a move is illegal or not)
Maybe a good example is the first eval, the "can we hit the opponent's checker" eval.
There are cases like this:
Position ID: 4PMnADDg28EBBg
Match ID : cIgNAAAAAAAA
+13-14-15-16-17-18------19-20-21-22-23-24-+
| X O O | | O X |
| X O | | O X |
| X O | | O |
| O | | O |
| 7 | | O |
v| |BAR| |
| | | X |
| | | X |
| X | | X |
| X X | | X O |
| X X | | X O |
+12-11-10--9--8--7-------6--5--4--3--2--1-+
dice: (3, 3)
Here, we can hit the opponent's checker given the board state and the dice roll - but the set of all moves that can hit is relatively large (since we rolled a double and the board state allows for it):
21/18 21/18 18/15* 7/4
21/18 21/18 18/15* 18/15
21/18 18/15* 8/5 6/3
21/18 18/15* 13/10 10/7
21/18 18/15* 15/12 12/9
21/18 21/18 18/15* 15/12
21/18 18/15* 15/12 13/10
21/18 18/15* 7/4 6/3
21/18 18/15* 15/12 7/4
21/18 18/15* 13/10 8/5
21/18 21/18 18/15* 6/3
21/18 21/18 18/15* 13/10
21/18 18/15* 6/3 6/3
21/18 18/15* 13/10 13/10
21/18 18/15* 13/10 6/3
21/18 18/15* 8/5 8/5
21/18 18/15* 8/5 7/4
21/18 21/18 18/15* 8/5
21/18 18/15* 8/5 5/2
21/18 18/15* 13/10 7/4
21/18 18/15* 15/12 8/5
21/18 18/15* 15/12 6/3
21/18 18/15* 7/4 7/4
Those are all unique plays - without any permutations, which will make this a really long "Includes
" list.
On the contrary, this list will be empty if there is no play available that can hit the opponent. After thinking about this I decided to go with the current setup.
But of course I am happy to change things around. For scenario 1 ("can we hit") - would you recommend to "include" all the possible valid plays which hit the opponent (like the example above) and include a section about the expected format in the prompt? Also - given the current setup for eval 1 - there will be cases where you cannot hit - those would then essentially be an empty list. Again - I am happy to make those changes. Another option is to just ask only for a single valid play.
Regarding the second eval ("is this an illegal move") - there things are a bit different - the move is already part of the prompt, e.g.
"... backgammonGPT is rolling a 2 and a 2. Is 24/22 13/11 6/4 6/4 an illegal play? Answer only with True or False and provide no other information."
What would you recommend in such a case? Thanks!
Those are all unique plays - without any permutations, which will make this a really long "
Includes
" list.
You don't need to add all the moves to the list. You just need to make the following changes:
-
Change the evaluation method from
Match
toIncludes
in theyaml
file. -
Instead of writing
Answer only with True or False and provide no other information.
, add the following instructions in the prompt:Provide your reasoning step by step, and at the end, write your final answer, True or False, enclosed in square brackets.
Changing prompt like this will give the model a proper chance to reason through all the possibilities and come to a final conclusion.
-
Instead of writing the ideal answer as
"ideal": "False"
, write it as"ideal": "[False]"
.
The model will output something like this after implementing the above suggestions:
To determine whether backgammonGPT can hit one of the opponent's checkers with a roll of 1 and 5, we need to examine the board position and identify any vulnerable opponent checkers within the range of the dice roll.
The opponent has the following checkers:
- 2 checkers on the 22 point
- 4 checkers on the 19 point
- 2 checkers on the 17 point
- 5 checkers on the 12 point
- 2 checkers on the 1 point
backgammonGPT has the following checkers:
- 2 checkers on the 24 point
- 4 checkers on the 9 point
- 4 checkers on the 8 point
- 5 checkers on the 6 point
With a dice roll of 1 and 5, we can make the following moves for backgammonGPT's checkers:
- Move a checker from the 24 point to the 23 point (with 1) and then to the 18 point (with 5).
- Move a checker from the 6 point to the 5 point (with 1) and then to the 0 point (off the board, with 5).
None of the potential moves can hit any of the opponent's checkers, as the checkers on the 22, 19, 17, and 1 points are too far away to hit with a roll of 1 and 5. The checkers on the 12 point are covered by more than one checker, which means they can't be hit either.
[False]
The Includes
method will check for the ideal answer in the completion, which is [False]
in this case.
Thanks for the comment - yes that makes total sense! I rebuild the data and adjusted the yaml file accordingly.
Thanks for implementing the requested changes. This PR is almost good. If you have no issues, can you place the generation script in the evals/registry/data/backgammon/
directory? That'll make it easy to regenerate the dataset in the future if anyone wants to.
Added the notebook to evals/registry/data/backgammon/
.