evals
evals copied to clipboard
Using different models in evaluating mode-graded eval and in generating the completion
Describe the feature or improvement you're requesting
build_eval.md says:
In general, the evaluation model and the model being evaluated don't have to be the same, though we will assume that they are here for ease of explanation.
However, I can't find anywhere how to do this. Is this currently implemented?
Additional context
No response
I recently struggled to get this to work too so I can share what I found.
This is currently implemented in the GitHub version of this repo (but not the one on PyPI that you get by installing it the library through any package manager, as these versions are many months out of date and have a version where gpt-3.5-turbo is hard-coded as the grader).
Lines 29-32 in evals/elsuite/modelgraded/classify.py show you how this feature is implemented: the last completion_fn given is treated as the evaluation function.
Completion functions in turn can be specified in a comma-separated string. The logic for this is at evals/cli/oaieval.py lines 142-145.
Concretely, a string like "gpt-4,gpt-3.5-turbo" seems to work for me to get gpt-4 to be the completer and gpt-3.5-turbo the one grading the responses.
However, be warned that there seems to be a slight bug where modelgraded eval execution can hang for a long time in a way that other evals don't (and seems unrelated to rate limits).
I had opened a PR last week (#1418) where I address this issue but forgot to mention it here.
Regarding #1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines are:
# treat last completion_fn as eval_completion_fn
self.eval_completion_fn = self.completion_fns[-1]
if len(self.completion_fns) > 1:
self.completion_fns = self.completion_fns[:-1]
If you pass in many (in a comma-separated list) into completion_fns
, then the last one will be treated as the evaluating model.
But wouldn't the task be run on the passed completion functions if doing so?
Il giorno lun 27 nov 2023 alle ore 15:50 LRudL @.***> ha scritto:
Regarding #1418 https://github.com/openai/evals/pull/1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines https://github.com/openai/evals/blob/7400b0ee3934d64ff6efd9d4ec04be631625c014/evals/elsuite/modelgraded/classify.py#L29C1-L29C1 are:
# treat last completion_fn as eval_completion_fn self.eval_completion_fn = self.completion_fns[-1] if len(self.completion_fns) > 1: self.completion_fns = self.completion_fns[:-1]
If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model.
— Reply to this email directly, view it on GitHub https://github.com/openai/evals/issues/1393#issuecomment-1828102761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIT3WHBJR4EJXJFD3HER233YGSZFHAVCNFSM6AAAAAA64JYRF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGEYDENZWGE . You are receiving this because you authored the thread.Message ID: @.***>
If you want to run the eval with modelA, and run the grading with modelB, then you can pass in the string "modelA,modelB" as the name of the completer.