evals icon indicating copy to clipboard operation
evals copied to clipboard

Chess: Counting pieces left on the board

Open jatinparab98 opened this issue 2 years ago • 1 comments

Eval details 📑

Eval name

Chess Piece Count

Eval description

Tests the models ability to understand and play out chess moves by reading input in a PGN format. Specifically, it tests the models ability to count the number of pieces remaining on the board after a set of moves have been played.

What makes this a useful eval?

The ability to calculate the game state by following game rules and performing a sequence of steps is important for the model to be able to teach and evaluate games. Chess, being a complex and fundamental game with clearly defined rules, it is a good benchmark to satisfy.

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

  • [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • [x] Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

This gave a 0.12 accuracy score on gpt-3.5-turbo. I tried running a similar prompt on Bing chatbot which is powered by GPT4, where interestingly it is able to play the game (make the moves) correctly, but somehow, fails to correctly count the number of pieces on the board. It outputs the FEN notation of the board correctly after every move, but doesn't count the pieces left correctly. The samples for this eval contain tournament games that vary in length, as well as in level of play.

Eval structure 🏗️

Your eval should

  • [x] Check that your data is in evals/registry/data/{name}
  • [x] Check that your yaml is registered at evals/registry/evals/{name}.jsonl
  • [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

  • [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

  • [x] I have filled out all required fields in the evals PR form
  • [x] (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. d4 d5 2. c4 e6 3. Nf3 Nf6 4. Nc3 Bb4 5. Qa4+ Nc6 6. a3 Bxc3+ 7. bxc3 Ne4 8. cxd5 exd5 9. c4 O-O 10. e3 dxc4 11. Bxc4 Be6 12. Bb2 Bxc4 13. Qxc4 Na5 14. Qa2 c5 15. dxc5 Qd3 16. Qb1 Nxc5 17. Bd4 Rfc8 18. Qxd3 Nxd3+ 19. Ke2 Nc5 20. Bxc5 Rxc5 21. Rhc1 Rac8 22. Rxc5 Rxc5 23. Nd4 Kf8 24. g4 g6 25. h3 Ke7 26. Kd3 Nc4 27. Rb1 Nd6 28. e4 Ra5 29. Rb3 Kd7 30. f4 b6 31. Rc3 Ra4 32. e5 Nb7 33. f5 Nc5+ 34. Ke3 gxf5\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["15"]}
{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 c6 5. Bg5 dxc4 6. e4 b5 7. e5 h6 8. Bh4 g5 9. Nxg5 hxg5 10. Bxg5 Nbd7 11. exf6 Bb7 12. Be2 Nxf6 13. Bf3 Be7 14. Qd2 Nd5 15. Bxe7 Qxe7 16. a4 a6 17. Ne4 O-O-O 18. axb5 cxb5 19. Nc5 e5 20. Kf1 exd4 21. Nxb7 Qxb7 22. Qxd4 Qb6 23. Qxb6 Nxb6 24. Rxa6 Na4 25. g3 Nxb2 26. Kg2 Kc7 27. Rha1 Rb8 28. Rf6 Na4 29. Rd1 Rbd8 30. Rc6+ Kb8 31. Rb1 Nc3 32. Ra1 Na4 33. Ra6 Kc7 34. h4 Rd6 35. Ra7+ Kb6 36. Rb7+ Kc5 37. Rxf7 c3 38. h5 Kb4 39. g4 Kb3 40. g5 Kb2 41. Re1 c2 42. Rc7 Nc3 43. g6 Re8 44. Rxe8 c1=Q 45. Re3 b4 46. g7 Rd8\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["12"]}
{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. e4 c5 2. Nf3 Nc6 3. Bb5 e6 4. O-O Nge7 5. c3 a6 6. Ba4 b5 7. Bc2 Bb7 8. Re1 g6 9. d4 cxd4 10. cxd4 Bg7 11. d5 Na5 12. d6 Nec6 13. Nbd2 g5 14. Nb3 Nxb3 15. Bxb3 g4 16. Nd2 h5 17. e5 Nxe5 18. Ne4 Bxe4 19. Rxe4 f5\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["25"]}
{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. d3 d6 6. c3 Bd7 7. Nbd2 g6\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["32"]}
{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. d4 Nf6 2. c4 g6 3. Nc3 Bg7 4. e4 d6 5. Nf3 O-O\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["32"]}
{"input": [{"role": "system", "content": "You are ChessGPT. A helpful AI chatbot that understands chess fundamental and helps people analyse their moves. Follow the instructions given to the point."}, {"role": "user", "content": "Imagine a standard chess board and play out the following moves one by one on it as described in the PGN notation format.\nAfter playing these moves tell me the number of pieces that are left on the board.\nMoves: 1. e4 c5 2. Nf3 e6 3. c3 d5 4. e5 d4 5. Na3 Ne7 6. Nc4 Nec6 7. a4 Nd7 8. Bd3 Be7 9. O-O Nb6 10. Re1 Nxc4 11. Bxc4 b6 12. cxd4 cxd4 13. b3 Bb7 14. Bb2 O-O 15. Bd3 Rc8 16. Be4 Qd7 17. Rc1 Nb4 18. Bxd4 Nd3 19. Rxc8 Rxc8 20. Bxd3 Bxf3 21. Qxf3 Qxd4 22. Qb7 Re8 23. Re3 Bc5 24. Rf3 Rf8 25. Qxa7 Qxe5 26. Qd7 f5\nYour answer should strictly only contain a single number, no additional words."}], "ideal": ["19"]}

jatinparab98 avatar Mar 16 '23 11:03 jatinparab98

Related PRs (same topic, chess):

  • #45
  • #124
  • #134
  • #239
  • #245

Ein-Tim avatar Mar 16 '23 15:03 Ein-Tim