evals Using different models in evaluating mode-graded eval and in generating the completion

Describe the feature or improvement you're requesting

In general, the evaluation model and the model being evaluated don't have to be the same, though we will assume that they are here for ease of explanation.

However, I can't find anywhere how to do this. Is this currently implemented?

Additional context

No response

Nov 03 '23 11:11 LoryPack

I recently struggled to get this to work too so I can share what I found.

This is currently implemented in the GitHub version of this repo (but not the one on PyPI that you get by installing it the library through any package manager, as these versions are many months out of date and have a version where gpt-3.5-turbo is hard-coded as the grader).

Lines 29-32 in evals/elsuite/modelgraded/classify.py show you how this feature is implemented: the last completion_fn given is treated as the evaluation function.

Completion functions in turn can be specified in a comma-separated string. The logic for this is at evals/cli/oaieval.py lines 142-145.

Concretely, a string like "gpt-4,gpt-3.5-turbo" seems to work for me to get gpt-4 to be the completer and gpt-3.5-turbo the one grading the responses.

However, be warned that there seems to be a slight bug where modelgraded eval execution can hang for a long time in a way that other evals don't (and seems unrelated to rate limits).

Nov 27 '23 11:11 LRudL

I had opened a PR last week (#1418) where I address this issue but forgot to mention it here.

Nov 27 '23 12:11 LoryPack

Regarding #1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines are:

        # treat last completion_fn as eval_completion_fn
        self.eval_completion_fn = self.completion_fns[-1]
        if len(self.completion_fns) > 1:
            self.completion_fns = self.completion_fns[:-1]

If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model.

Nov 27 '23 15:11 LRudL

But wouldn't the task be run on the passed completion functions if doing so?

Il giorno lun 27 nov 2023 alle ore 15:50 LRudL @.***> ha scritto:

Regarding #1418 https://github.com/openai/evals/pull/1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines https://github.com/openai/evals/blob/7400b0ee3934d64ff6efd9d4ec04be631625c014/evals/elsuite/modelgraded/classify.py#L29C1-L29C1 are:
    # treat last completion_fn as eval_completion_fn
    self.eval_completion_fn = self.completion_fns[-1]
    if len(self.completion_fns) > 1:
        self.completion_fns = self.completion_fns[:-1]
If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model.

— Reply to this email directly, view it on GitHub https://github.com/openai/evals/issues/1393#issuecomment-1828102761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIT3WHBJR4EJXJFD3HER233YGSZFHAVCNFSM6AAAAAA64JYRF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGEYDENZWGE . You are receiving this because you authored the thread.Message ID: @.***>

Nov 27 '23 16:11 LoryPack

If you want to run the eval with modelA, and run the grading with modelB, then you can pass in the string "modelA,modelB" as the name of the completer.

Nov 27 '23 17:11 LRudL

evals evals copied to clipboard

Using different models in evaluating mode-graded eval and in generating the completion

Describe the feature or improvement you're requesting

Additional context

evals
evals copied to clipboard