evals
evals copied to clipboard
fix key and access errors in custom-eval.md (issue #348)
Thanks to @rarhs who brought up this issue in #348.
TL;DR There are two issues in docs/custom-eval.md
: (1) The training samples have no problem
and answer
keys but eval_sample()
assumes it; and (2) eval_sample()
tries to access a dict
when the intention is to access a list
.
Observe the documentation states:
docs/custom-eval.md
echo -e '[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]\n[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]\n[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]' > /tmp/test.jsonl
and leads to the creation of training and test data:
/tmp/train.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]
Now observe eval_sample()
:
def eval_sample(self, test_sample, rng: random.Random):
"""
...
"""
stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
prompt = [
{"role": "system", "content": "Solve the following math problems"},
]
for i, sample in enumerate(stuffing + [test_sample]):
if i < len(stuffing):
prompt += [
{"role": "system", "content": sample["problem"], "name": "example_user"},
{"role": "system", "content": sample["answer"], "name": "example_assistant"},
]
else:
prompt += [{"role": "user", "content": sample["problem"]}]
evals.check_sampled_text(self.model_spec, prompt, expected=sample["answer"])
Specifically:
for i, sample in enumerate(stuffing + [test_sample]):
sample["problem"]
sample["answer"]
sample["problem"]
sample["answer"]
Here, it's grabbing test samples. Observe:
sample["problem"]
and
sample["answer"]
and notice that neither problem
nor answer
are found in the samples:
/tmp/train.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]
Hence, one solution is to "make it found" in the samples. Specifically, change content
to either problem
or answer
in the intended places:
docs/custom-eval.md
echo -e '[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]\n[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]\n[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]' > /tmp/test.jsonl
/tmp/train.jsonl
[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]
[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]
/tmp/test.jsonl
[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]
[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]
To solve the next issue, recall the way the values are accessed:
sample["problem"]
and
sample["answer"]
Here, it's trying to access a dictionary, however:
/tmp/test.jsonl
[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]
each JSON line is a list; hence, assuming we keep the test file as such, we must access the list first.
Thus, eval_sample()
can be updated to:
def eval_sample(self, test_sample, rng: random.Random):
"""
Called by the `eval_all_samples` method to evaluate a single sample.
ARGS
====
`test_sample`: a line from the JSONL test file
`rng`: should be used for any randomness that is needed during evaluation
This method does the following:
1. Generate a prompt that contains the task statement, a few examples, and the test question.
2. Check if the model generates the correct answer.
"""
stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)
prompt = [
{"role": "system", "content": "Solve the following math problems"},
]
for i, sample in enumerate(stuffing + [test_sample]):
if i < len(stuffing):
prompt += [
{"role": "system", "content": sample[0]["problem"], "name": "example_user"},
{"role": "system", "content": sample[1]["answer"], "name": "example_assistant"},
]
else:
prompt += [{"role": "user", "content": sample[0]["problem"]}]
evals.check_sampled_text(self.model_spec, prompt, expected=sample[1]["answer"])
Observe that I added [0]
s and [1]
s to first access the list. This modification makes it consistent with the samples, which first assume a list to access the data.
I was able to use the debugger in VS Code by creating a folder called .vscode
at the root of the project directory and putting a launch.json
file in it with the following configuration:
.vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Run custom arithmetic eval",
"type": "python",
"request": "launch",
"program": "${workspaceFolder}/evals/cli/oaieval.py",
"args": [
"gpt-3.5-turbo",
"arithmetic"
],
"console": "integratedTerminal"
}
]
}
After running it:
user@user:~/evals$ cd /home/user/evals ; /usr/bin/env /home/user/evals/env/bin/python /home/user/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 50149 -- /home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic
[2023-03-19 14:11:00,322] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-19 14:11:00,440] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-19 14:11:00,987] [oaieval.py:178] Run started: 230319181100THJXSEOF
[2023-03-19 14:11:00,990] [eval.py:30] Evaluating 2 samples
[2023-03-19 14:11:01,018] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-19 14:11:01,911] [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
[2023-03-19 14:11:01,912] [oaieval.py:209] Final report:
[2023-03-19 14:11:01,912] [oaieval.py:211] accuracy: 1.0
[2023-03-19 14:11:02,649] [record.py:309] Logged 6 rows of events to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl: insert_time=1.595ms
I was able to get an accuracy
metric of 1.0
with the following output (formatted as a .json
and not .jsonl
for readability):
/tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
{
"spec": {
"model_name": "gpt-3.5-turbo",
"model_names": {
"completions": [
"gpt-3.5-turbo"
]
},
"eval_name": "arithmetic.dev.match-v1",
"base_eval": "arithmetic",
"split": "dev",
"run_config": {
"model_specs": {
"completions_": [
{
"name": "gpt-3.5-turbo",
"model": "gpt-3.5-turbo",
"is_chat": true,
"encoding": null,
"organization": null,
"api_key": null,
"extra_options": {},
"headers": {},
"strip_completion": true,
"n_ctx": 4096,
"format": null,
"key": null,
"group": null
}
],
"embedding_": null,
"ranking_": null
},
"eval_spec": {
"cls": "evals.elsuite.arithmetic:Arithmetic",
"args": {
"train_jsonl": "/tmp/train.jsonl",
"test_jsonl": "/tmp/test.jsonl"
},
"key": "arithmetic.dev.match-v1",
"group": "arithmetic"
},
"seed": 20220722,
"max_samples": null,
"command": "/home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic",
"initial_settings": {
"visible": true
}
},
"created_by": "",
"run_id": "230319181100THJXSEOF",
"created_at": "2023-03-19 18:11:00.985636"
}
}
{
"final_report": {
"accuracy": 1.0
}
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 0,
"sample_id": "arithmetic.dev.1",
"type": "raw_sample",
"data": [
{
"role": "system",
"problem": "4*4=",
"name": "example_user"
},
{
"role": "system",
"answer": "16",
"name": "example_assistant"
}
],
"created_by": "",
"created_at": "2023-03-19 18:11:01.020114+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 1,
"sample_id": "arithmetic.dev.0",
"type": "raw_sample",
"data": [
{
"role": "system",
"problem": "2+2=",
"name": "example_user"
},
{
"role": "system",
"answer": "4",
"name": "example_assistant"
}
],
"created_by": "",
"created_at": "2023-03-19 18:11:01.025320+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 2,
"sample_id": "arithmetic.dev.0",
"type": "sampling",
"data": {
"prompt": [
{
"role": "system",
"content": "Solve the following math problems"
},
{
"role": "system",
"content": "5*20=",
"name": "example_user"
},
{
"role": "system",
"content": "100",
"name": "example_assistant"
},
{
"role": "system",
"content": "48+2=",
"name": "example_user"
},
{
"role": "system",
"content": "50",
"name": "example_assistant"
},
{
"role": "user",
"content": "2+2="
}
],
"sampled": "4",
"options": [
"4"
],
"picked": "4",
"expected": [
"4"
],
"match": true,
"metadata": {
"completion_id": "chatcmpl-6vrmjgKmc1MtZhavJOpT6nwCwkaft",
"model": "gpt-3.5-turbo-0301"
}
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.885894+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 3,
"sample_id": "arithmetic.dev.0",
"type": "match",
"data": {
"correct": true,
"expected": "4",
"picked": "4",
"sampled": "4"
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.885968+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 4,
"sample_id": "arithmetic.dev.1",
"type": "sampling",
"data": {
"prompt": [
{
"role": "system",
"content": "Solve the following math problems"
},
{
"role": "system",
"content": "5*20=",
"name": "example_user"
},
{
"role": "system",
"content": "100",
"name": "example_assistant"
},
{
"role": "system",
"content": "48+2=",
"name": "example_user"
},
{
"role": "system",
"content": "50",
"name": "example_assistant"
},
{
"role": "user",
"content": "4*4="
}
],
"sampled": "16",
"options": [
"16"
],
"picked": "16",
"expected": [
"16"
],
"match": true,
"metadata": {
"completion_id": "chatcmpl-6vrmjWDVgCDKZNpKt6j9l7OrAjv56",
"model": "gpt-3.5-turbo-0301"
}
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.903103+00:00"
}
{
"run_id": "230319181100THJXSEOF",
"event_id": 5,
"sample_id": "arithmetic.dev.1",
"type": "match",
"data": {
"correct": true,
"expected": "16",
"picked": "16",
"sampled": "16"
},
"created_by": "",
"created_at": "2023-03-19 18:11:01.903153+00:00"
}
@andrew-openai
@rlbayes
Fixed in https://github.com/openai/evals/pull/1113