evals
evals copied to clipboard
[evals] Add eval which tests models ability to respond in base64 strings
Thank you for contributing an eval! ♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.
Eval details 📑
Eval name
Base64 Questions
Eval description
This eval tests the models ability to respond to questions in base64 string. A question is asked in plain english and the model is asked to respond only in base64 encoded strings.
On GPT3.5, almost all evals have a similar failure mode, which is the model responding with a portion of the question string encoded in base64 rather than the answer.
I'm eager to see whether GPT4 has better handling of this format.
What makes this a useful eval?
This eval is useful because it highlights an inability of the model to correctly reason through the problem. A human would easily answer these by first determining the plain English answer and then encoding the answer in base64. As best as I can guess, GPT3.5 is failing because it is trying to do both at the same time and losing track of what its task is.
It's also interesting because the model clearly can handle encoding/decoding tasks to some extent -- most of the model outputs are valid base64, they aren't correct answers to the question. But they aren't gibberish and they are clearly contextual to the question. It's interesting that it's not just spitting out random base64 strings but actually is acting upon the input text.
There are a few other tasks in this area that I'd be happy to add another eval for:
- Simple encoding/decoding of base64 strings. GPT3.5 does relatively well at this but not flawlessly in my experience, particularly on larger or more complicated inputs.
- Receiving a question in base64 and answering it (either in base64 or english). In my experience the model has multiple difficulties here -- first with respecting the prompts instructions not to included the translated question in its response, and then with giving a coherent answer if it doesn't included the question in the response.
Please let me know if either of the above would be interesting additions.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
- [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means either a correct answer for
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)
If there is anything else that makes your eval worth including, please document it below.
Unique eval value
Insert what makes your eval high quality that was not mentioned above. (Not required)
Eval structure 🏗️
Your eval should
- [x] Check that your data is in
evals/registry/data/{name}
- [x] Check that your yaml is registered at
evals/registry/evals/{name}.yaml
- [x] Ensure you have the right to use the data you submit via this eval
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
- [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
- [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
- [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.
Submit eval
- [x] I have filled out all required fields in the evals PR form
- [x] (Ignore if not submitting code) I have run
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and push
Failure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of France? Respond in a base64 encoded string with no other text"}], "ideal": "UGFyaXM="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the tallest mountain in the world? Respond in a base64 encoded string with no other text"}], "ideal": "TW91bnQgRXZlcmVzdA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the chemical symbol for gold? Respond in a base64 encoded string with no other text"}], "ideal": "QXU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote the novel 'Pride and Prejudice'? Respond in a base64 encoded string with no other text"}], "ideal": "SmFuZSBBdXN0ZW4="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest continent in the world? Respond in a base64 encoded string with no other text"}], "ideal": "QXVzdHJhbGlh"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who discovered electricity? Respond in a base64 encoded string with no other text"}], "ideal": "QmVuamFtaW4gRnJhbmtsaW4="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest country in South America? Respond in a base64 encoded string with no other text"}], "ideal": "QnJhemls"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the square root of 144? Respond in a base64 encoded string with no other text"}], "ideal": "MTI="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted the Mona Lisa? Respond in a base64 encoded string with no other text"}], "ideal": "TGVvbmFyZG8gZGEgVmluY2k="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest planet in our solar system? Respond in a base64 encoded string with no other text"}], "ideal": "TWVyY3VyeQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who invented the telephone? Respond in a base64 encoded string with no other text"}], "ideal": "QWxleGFuZGVyIEdyYWhhbSBCZWxs"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the currency of Japan? Respond in a base64 encoded string with no other text"}], "ideal": "WWVu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'To Kill a Mockingbird'? Respond in a base64 encoded string with no other text"}], "ideal": "SGFycGVyIExlZQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest ocean in the world? Respond in a base64 encoded string with no other text"}], "ideal": "UGFjaWZpYyBPY2Vhbg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest mammal in the world? Respond in a base64 encoded string with no other text"}], "ideal": "Qmx1ZSBXaGFsZQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who was the first man to walk on the moon? Respond in a base64 encoded string with no other text"}], "ideal": "TmVpbCBBcm1zdHJvbmc="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Japan? Respond in a base64 encoded string with no other text"}], "ideal": "VG9reW8="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Catcher in the Rye'? Respond in a base64 encoded string with no other text"}], "ideal": "Si5ELiBTYWxpbmdlcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the symbol for sodium on the periodic table? Respond in a base64 encoded string with no other text"}], "ideal": "TmE="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Starry Night'? Respond in a base64 encoded string with no other text"}], "ideal": "VmluY2VudCB2YW4gR29naA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest bird in the world? Respond in a base64 encoded string with no other text"}], "ideal": "T3N0cmljaA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who was the first female prime minister of the UK? Respond in a base64 encoded string with no other text"}], "ideal": "TWFyZ2FyZXQgVGhhdGNoZXI="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Adventures of Huckleberry Finn'? Respond in a base64 encoded string with no other text"}], "ideal": "TWFyayBUd2Fpbg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the name of the largest river in Asia? Respond in a base64 encoded string with no other text"}], "ideal": "WWFuZ3R6ZSBSaXZlcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Canada? Respond in a base64 encoded string with no other text"}], "ideal": "T3R0YXdh"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who discovered penicillin? Respond in a base64 encoded string with no other text"}], "ideal": "QWxleGFuZGVyIEZsZW1pbmc="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest island in the world? Respond in a base64 encoded string with no other text"}], "ideal": "R3JlZW5sYW5k"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote '1984'? Respond in a base64 encoded string with no other text"}], "ideal": "R2VvcmdlIE9yd2VsbA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted the Sistine Chapel ceiling? Respond in a base64 encoded string with no other text"}], "ideal": "TWljaGVsYW5nZWxv"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest country in Africa? Respond in a base64 encoded string with no other text"}], "ideal": "QWxnZXJpYQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who was the first US president? Respond in a base64 encoded string with no other text"}], "ideal": "R2VvcmdlIFdhc2hpbmd0b24="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the chemical symbol for oxygen? Respond in a base64 encoded string with no other text"}], "ideal": "Tw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Great Gatsby'? Respond in a base64 encoded string with no other text"}], "ideal": "Ri4gU2NvdHQgRml0emdlcmFsZA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest lake in Africa? Respond in a base64 encoded string with no other text"}], "ideal": "TGFrZSBWaWN0b3JpYQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who discovered America? Respond in a base64 encoded string with no other text"}], "ideal": "Q2hyaXN0b3BoZXIgQ29sdW1idXM="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest country in the world? Respond in a base64 encoded string with no other text"}], "ideal": "VmF0aWNhbiBDaXR5"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Lord of the Rings'? Respond in a base64 encoded string with no other text"}], "ideal": "Si5SLlIuIFRvbGtpZW4="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Last Supper'? Respond in a base64 encoded string with no other text"}], "ideal": "TGVvbmFyZG8gZGEgVmluY2k="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest planet in our solar system? Respond in a base64 encoded string with no other text"}], "ideal": "SnVwaXRlcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who invented the light bulb? Respond in a base64 encoded string with no other text"}], "ideal": "VGhvbWFzIEVkaXNvbg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What do you call a baby cat? Respond in a base64 encoded string with no other text"}], "ideal": "S2l0dGVu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest ocean in the world? Respond in a base64 encoded string with no other text"}], "ideal": "QXJjdGljIE9jZWFu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'One Hundred Years of Solitude'? Respond in a base64 encoded string with no other text"}], "ideal": "R2FicmllbCBHYXJjaWEgTWFycXVleg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest city in Australia? Respond in a base64 encoded string with no other text"}], "ideal": "U3lkbmV5"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who discovered gravity? Respond in a base64 encoded string with no other text"}], "ideal": "SXNhYWMgTmV3dG9u"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the currency of Russia? Respond in a base64 encoded string with no other text"}], "ideal": "UnVibGU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the chemical symbol for helium? Respond in a base64 encoded string with no other text"}], "ideal": "SGU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Odyssey'? Respond in a base64 encoded string with no other text"}], "ideal": "SG9tZXI="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Australia? Respond in a base64 encoded string with no other text"}], "ideal": "Q2FuYmVycmE="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest country in the world by land area? Respond in a base64 encoded string with no other text"}], "ideal": "UnVzc2lh"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Persistence of Memory'? Respond in a base64 encoded string with no other text"}], "ideal": "U2FsdmFkb3IgRGFsaQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest state in the United States? Respond in a base64 encoded string with no other text"}], "ideal": "UmhvZGUgSXNsYW5k"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Iliad'? Respond in a base64 encoded string with no other text"}], "ideal": "SG9tZXI="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest waterfall in North America? Respond in a base64 encoded string with no other text"}], "ideal": "TmlhZ2FyYSBGYWxscw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest planet in our solar system by diameter? Respond in a base64 encoded string with no other text"}], "ideal": "TWVyY3VyeQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who invented the printing press? Respond in a base64 encoded string with no other text"}], "ideal": "Sm9oYW5uZXMgR3V0ZW5iZXJn"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest city in Brazil? Respond in a base64 encoded string with no other text"}], "ideal": "U8OjbyBQYXVsbw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Picture of Dorian Gray'? Respond in a base64 encoded string with no other text"}], "ideal": "T3NjYXIgV2lsZGU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Egypt? Respond in a base64 encoded string with no other text"}], "ideal": "Q2Fpcm8="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest island in the Mediterranean Sea? Respond in a base64 encoded string with no other text"}], "ideal": "U2ljaWx5"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Birth of Venus'? Respond in a base64 encoded string with no other text"}], "ideal": "U2FuZHJvIEJvdHRpY2VsbGk="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest country in South America? Respond in a base64 encoded string with no other text"}], "ideal": "U3VyaW5hbWU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who invented the steam engine? Respond in a base64 encoded string with no other text"}], "ideal": "SmFtZXMgV2F0dA=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What do you call a group of wolves? Respond in a base64 encoded string with no other text"}], "ideal": "UGFjaw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Brothers Karamazov'? Respond in a base64 encoded string with no other text"}], "ideal": "RnlvZG9yIERvc3RvZXZza3k="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest lake in the United States? Respond in a base64 encoded string with no other text"}], "ideal": "TGFrZSBTdXBlcmlvcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of South Korea? Respond in a base64 encoded string with no other text"}], "ideal": "U2VvdWw="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest island in the Caribbean? Respond in a base64 encoded string with no other text"}], "ideal": "Q3ViYQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Divine Comedy'? Respond in a base64 encoded string with no other text"}], "ideal": "RGFudGUgQWxpZ2hpZXJp"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest river in Europe? Respond in a base64 encoded string with no other text"}], "ideal": "Vm9sZ2EgUml2ZXI="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the highest-grossing film of all time? Respond in a base64 encoded string with no other text"}], "ideal": "QXZhdGFy"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who discovered radioactivity? Respond in a base64 encoded string with no other text"}], "ideal": "TWFyaWUgQ3VyaWU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'Crime and Punishment'? Respond in a base64 encoded string with no other text"}], "ideal": "RnlvZG9yIERvc3RvZXZza3k="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest lake in South America? Respond in a base64 encoded string with no other text"}], "ideal": "TGFrZSBUaXRpY2FjYQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the highest mountain in Africa? Respond in a base64 encoded string with no other text"}], "ideal": "TW91bnQgS2lsaW1hbmphcm8="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest ocean by volume? Respond in a base64 encoded string with no other text"}], "ideal": "QXJjdGljIE9jZWFu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Old Man and the Sea'? Respond in a base64 encoded string with no other text"}], "ideal": "RXJuZXN0IEhlbWluZ3dheQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest city in China by population? Respond in a base64 encoded string with no other text"}], "ideal": "U2hhbmdoYWk="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Canterbury Tales'? Respond in a base64 encoded string with no other text"}], "ideal": "R2VvZmZyZXkgQ2hhdWNlcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest lake in North America? Respond in a base64 encoded string with no other text"}], "ideal": "TGFrZSBTdXBlcmlvcg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Creation of Adam'? Respond in a base64 encoded string with no other text"}], "ideal": "TWljaGVsYW5nZWxv"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Turkey? Respond in a base64 encoded string with no other text"}], "ideal": "QW5rYXJh"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Kenya? Respond in a base64 encoded string with no other text"}], "ideal": "TmFpcm9iaQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest city in Canada by population? Respond in a base64 encoded string with no other text"}], "ideal": "VG9yb250bw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Portugal? Respond in a base64 encoded string with no other text"}], "ideal": "TGlzYm9u"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Greece? Respond in a base64 encoded string with no other text"}], "ideal": "QXRoZW5z"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the tallest mammal? Respond in a base64 encoded string with no other text"}], "ideal": "R2lyYWZmZQ=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'The Scream'? Respond in a base64 encoded string with no other text"}], "ideal": "RWR2YXJkIE11bmNo"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Argentina? Respond in a base64 encoded string with no other text"}], "ideal": "QnVlbm9zIEFpcmVz"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Iran? Respond in a base64 encoded string with no other text"}], "ideal": "VGVocmFu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who painted 'Guernica'? Respond in a base64 encoded string with no other text"}], "ideal": "UGFibG8gUGljYXNzbw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the largest city in Africa by population? Respond in a base64 encoded string with no other text"}], "ideal": "TGFnb3M="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'War and Peace'? Respond in a base64 encoded string with no other text"}], "ideal": "TGVvIFRvbHN0b3k="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Ireland? Respond in a base64 encoded string with no other text"}], "ideal": "RHVibGlu"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'One Hundred Years of Solitude'? Respond in a base64 encoded string with no other text"}], "ideal": "R2FicmllbCBHYXJjaWEgTWFycXVleg=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Thailand? Respond in a base64 encoded string with no other text"}], "ideal": "QmFuZ2tvaw=="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the smallest continent by land area? Respond in a base64 encoded string with no other text"}], "ideal": "QXVzdHJhbGlh"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'Moby-Dick'? Respond in a base64 encoded string with no other text"}], "ideal": "SGVybWFuIE1lbHZpbGxl"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "Who wrote 'The Picture of Dorian Gray'? Respond in a base64 encoded string with no other text"}], "ideal": "T3NjYXIgV2lsZGU="}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the capital of Saudi Arabia? Respond in a base64 encoded string with no other text"}], "ideal": "Uml5YWRo"}
{"input": [{ "role": "system", "content": "You are a helpful coding assistant." }, { "role": "user", "content": "What is the capital of Norway? Respond in a base64 encoded string with no other text"}, { "role": "assistant", "content": "T3Nsbw=="}, { "role": "user", "content": "What is the name of the planet we live on? Respond in a base64 encoded string with no other text"}], "ideal": "RWFydGg="}