evals fix key and access errors in custom-eval.md (issue #348)

Thanks to @rarhs who brought up this issue in #348.

TL;DR There are two issues in `docs/custom-eval.md`: (1) The training samples have no `problem` and `answer` keys but `eval_sample()` assumes it; and (2) `eval_sample()` tries to access a `dict` when the intention is to access a `list`.

Observe the documentation states:

docs/custom-eval.md

echo -e '[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]\n[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]\n[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]' > /tmp/test.jsonl

and leads to the creation of training and test data:

/tmp/train.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]

Now observe eval_sample():

    def eval_sample(self, test_sample, rng: random.Random):
        """
        ...
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)

        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]

        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample["problem"], "name": "example_user"},
                    {"role": "system", "content": sample["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample["problem"]}]

        evals.check_sampled_text(self.model_spec, prompt, expected=sample["answer"])

Specifically:

        for i, sample in enumerate(stuffing + [test_sample]):

                                                  sample["problem"]
                                                  sample["answer"]


                                                       sample["problem"]

                                                                   sample["answer"]

Here, it's grabbing test samples. Observe:

sample["problem"]

and

sample["answer"]

and notice that neither problem nor answer are found in the samples:

/tmp/train.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "content": "48+2=", "name": "example_user"}, {"role": "system", "content": "50", "name": "example_assistant"}]
[{"role": "system", "content": "5*20=", "name": "example_user"}, {"role": "system", "content": "100", "name": "example_assistant"}]

Hence, one solution is to "make it found" in the samples. Specifically, change content to either problem or answer in the intended places:

docs/custom-eval.md

echo -e '[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]\n[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]' > /tmp/train.jsonl
echo -e '[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]\n[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]' > /tmp/test.jsonl

/tmp/train.jsonl

[{"role": "system", "problem": "2+2=", "name": "example_user"}, {"role": "system", "answer": "4", "name": "example_assistant"}]
[{"role": "system", "problem": "4*4=", "name": "example_user"}, {"role": "system", "answer": "16", "name": "example_assistant"}]

/tmp/test.jsonl

[{"role": "system", "problem": "48+2=", "name": "example_user"}, {"role": "system", "answer": "50", "name": "example_assistant"}]
[{"role": "system", "problem": "5*20=", "name": "example_user"}, {"role": "system", "answer": "100", "name": "example_assistant"}]

To solve the next issue, recall the way the values are accessed:

sample["problem"]

and

sample["answer"]

Here, it's trying to access a dictionary, however:

/tmp/test.jsonl

[{"role": "system", "content": "2+2=", "name": "example_user"}, {"role": "system", "content": "4", "name": "example_assistant"}]
[{"role": "system", "content": "4*4=", "name": "example_user"}, {"role": "system", "content": "16", "name": "example_assistant"}]

each JSON line is a list; hence, assuming we keep the test file as such, we must access the list first.

Thus, eval_sample() can be updated to:

    def eval_sample(self, test_sample, rng: random.Random):
        """
        Called by the `eval_all_samples` method to evaluate a single sample.

        ARGS
        ====
        `test_sample`: a line from the JSONL test file
        `rng`: should be used for any randomness that is needed during evaluation

        This method does the following:
        1. Generate a prompt that contains the task statement, a few examples, and the test question.
        2. Check if the model generates the correct answer.
        """
        stuffing = rng.sample(self.train_samples, self.train_samples_per_prompt)

        prompt = [
            {"role": "system", "content": "Solve the following math problems"},
        ]

        for i, sample in enumerate(stuffing + [test_sample]):
            if i < len(stuffing):
                prompt += [
                    {"role": "system", "content": sample[0]["problem"], "name": "example_user"},
                    {"role": "system", "content": sample[1]["answer"], "name": "example_assistant"},
                ]
            else:
                prompt += [{"role": "user", "content": sample[0]["problem"]}]

        evals.check_sampled_text(self.model_spec, prompt, expected=sample[1]["answer"])

Observe that I added [0]s and [1]s to first access the list. This modification makes it consistent with the samples, which first assume a list to access the data.

I was able to use the debugger in VS Code by creating a folder called .vscode at the root of the project directory and putting a launch.json file in it with the following configuration:

.vscode/launch.json

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Run custom arithmetic eval",
            "type": "python",
            "request": "launch",
            "program": "${workspaceFolder}/evals/cli/oaieval.py",
            "args": [
                "gpt-3.5-turbo",
                "arithmetic"
            ],
            "console": "integratedTerminal"
        }
    ]
}

After running it:

user@user:~/evals$  cd /home/user/evals ; /usr/bin/env /home/user/evals/env/bin/python /home/user/.vscode-server/extensions/ms-python.python-2023.4.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 50149 -- /home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic 
[2023-03-19 14:11:00,322] [registry.py:145] Loading registry from /home/user/evals/evals/registry/evals
[2023-03-19 14:11:00,440] [registry.py:145] Loading registry from /home/user/.evals/evals
[2023-03-19 14:11:00,987] [oaieval.py:178] Run started: 230319181100THJXSEOF
[2023-03-19 14:11:00,990] [eval.py:30] Evaluating 2 samples
[2023-03-19 14:11:01,018] [eval.py:136] Running in threaded mode with 10 threads!
[2023-03-19 14:11:01,911] [record.py:320] Final report: {'accuracy': 1.0}. Logged to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl
[2023-03-19 14:11:01,912] [oaieval.py:209] Final report:
[2023-03-19 14:11:01,912] [oaieval.py:211] accuracy: 1.0
[2023-03-19 14:11:02,649] [record.py:309] Logged 6 rows of events to /tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl: insert_time=1.595ms

I was able to get an accuracy metric of 1.0 with the following output (formatted as a .json and not .jsonl for readability):

/tmp/evallogs/230319181100THJXSEOF_gpt-3.5-turbo_arithmetic.jsonl

{
    "spec": {
        "model_name": "gpt-3.5-turbo",
        "model_names": {
            "completions": [
                "gpt-3.5-turbo"
            ]
        },
        "eval_name": "arithmetic.dev.match-v1",
        "base_eval": "arithmetic",
        "split": "dev",
        "run_config": {
            "model_specs": {
                "completions_": [
                    {
                        "name": "gpt-3.5-turbo",
                        "model": "gpt-3.5-turbo",
                        "is_chat": true,
                        "encoding": null,
                        "organization": null,
                        "api_key": null,
                        "extra_options": {},
                        "headers": {},
                        "strip_completion": true,
                        "n_ctx": 4096,
                        "format": null,
                        "key": null,
                        "group": null
                    }
                ],
                "embedding_": null,
                "ranking_": null
            },
            "eval_spec": {
                "cls": "evals.elsuite.arithmetic:Arithmetic",
                "args": {
                    "train_jsonl": "/tmp/train.jsonl",
                    "test_jsonl": "/tmp/test.jsonl"
                },
                "key": "arithmetic.dev.match-v1",
                "group": "arithmetic"
            },
            "seed": 20220722,
            "max_samples": null,
            "command": "/home/user/evals/evals/cli/oaieval.py gpt-3.5-turbo arithmetic",
            "initial_settings": {
                "visible": true
            }
        },
        "created_by": "",
        "run_id": "230319181100THJXSEOF",
        "created_at": "2023-03-19 18:11:00.985636"
    }
}
{
    "final_report": {
        "accuracy": 1.0
    }
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 0,
    "sample_id": "arithmetic.dev.1",
    "type": "raw_sample",
    "data": [
        {
            "role": "system",
            "problem": "4*4=",
            "name": "example_user"
        },
        {
            "role": "system",
            "answer": "16",
            "name": "example_assistant"
        }
    ],
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.020114+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 1,
    "sample_id": "arithmetic.dev.0",
    "type": "raw_sample",
    "data": [
        {
            "role": "system",
            "problem": "2+2=",
            "name": "example_user"
        },
        {
            "role": "system",
            "answer": "4",
            "name": "example_assistant"
        }
    ],
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.025320+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 2,
    "sample_id": "arithmetic.dev.0",
    "type": "sampling",
    "data": {
        "prompt": [
            {
                "role": "system",
                "content": "Solve the following math problems"
            },
            {
                "role": "system",
                "content": "5*20=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "100",
                "name": "example_assistant"
            },
            {
                "role": "system",
                "content": "48+2=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "50",
                "name": "example_assistant"
            },
            {
                "role": "user",
                "content": "2+2="
            }
        ],
        "sampled": "4",
        "options": [
            "4"
        ],
        "picked": "4",
        "expected": [
            "4"
        ],
        "match": true,
        "metadata": {
            "completion_id": "chatcmpl-6vrmjgKmc1MtZhavJOpT6nwCwkaft",
            "model": "gpt-3.5-turbo-0301"
        }
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.885894+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 3,
    "sample_id": "arithmetic.dev.0",
    "type": "match",
    "data": {
        "correct": true,
        "expected": "4",
        "picked": "4",
        "sampled": "4"
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.885968+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 4,
    "sample_id": "arithmetic.dev.1",
    "type": "sampling",
    "data": {
        "prompt": [
            {
                "role": "system",
                "content": "Solve the following math problems"
            },
            {
                "role": "system",
                "content": "5*20=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "100",
                "name": "example_assistant"
            },
            {
                "role": "system",
                "content": "48+2=",
                "name": "example_user"
            },
            {
                "role": "system",
                "content": "50",
                "name": "example_assistant"
            },
            {
                "role": "user",
                "content": "4*4="
            }
        ],
        "sampled": "16",
        "options": [
            "16"
        ],
        "picked": "16",
        "expected": [
            "16"
        ],
        "match": true,
        "metadata": {
            "completion_id": "chatcmpl-6vrmjWDVgCDKZNpKt6j9l7OrAjv56",
            "model": "gpt-3.5-turbo-0301"
        }
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.903103+00:00"
}
{
    "run_id": "230319181100THJXSEOF",
    "event_id": 5,
    "sample_id": "arithmetic.dev.1",
    "type": "match",
    "data": {
        "correct": true,
        "expected": "16",
        "picked": "16",
        "sampled": "16"
    },
    "created_by": "",
    "created_at": "2023-03-19 18:11:01.903153+00:00"
}

Mar 19 '23 18:03 jonathanagustin

@andrew-openai

Mar 30 '23 16:03 jonathanagustin

@rlbayes

Apr 01 '23 09:04 jonathanagustin

Fixed in https://github.com/openai/evals/pull/1113

Jun 12 '23 22:06 jwang47

evals evals copied to clipboard

fix key and access errors in custom-eval.md (issue #348)

TL;DR There are two issues in docs/custom-eval.md: (1) The training samples have no problem and answer keys but eval_sample() assumes it; and (2) eval_sample() tries to access a dict when the intention is to access a list.

evals
evals copied to clipboard

TL;DR There are two issues in `docs/custom-eval.md`: (1) The training samples have no `problem` and `answer` keys but `eval_sample()` assumes it; and (2) `eval_sample()` tries to access a `dict` when the intention is to access a `list`.