lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add Logits to OpenAI ChatCompletions model
OpenAI added logits back to their ChatCompletions API.
This means we can re-add support for all tasks to this LM type! See OpenAICompletionsLM
for an example of how to do this.
Contributions on this would be very welcomed if people have the bandwidth, as I may not get to this super soon but it's high priority.
I've been informed that returning the logprobs of the prompt / input which we would want to be able to support logprobs is not included.
However, supporting logprobs for ChatCompletions-mirroring APIs that do return logprobs including for echo=True
would still be very desirable.
Hi Hailey, I'm interested in contributing to this GitHub issue, but I would appreciate some additional context to better understand the task. Could you please provide more details? Specifically, it would be helpful if you could instruct me on which file(s) are affected and what specific changes are needed. Actually, overall, I need more clarification. Thank you very much for your attention to this matter, looking forward to hear from you! Cheers!
Language models return output in the form of "logits" / "logprobs" (log probabilities) over their vocabulary of possible next tokens. You can use these to sample text probabilistically or deterministically from, or, these allow you to get the estimated (log) probability of a string by the language model.
However, many closed model providers such as OpenAI stopped providing this info to users, and just gave generated text out. They've re-added that feature for their chat models now. This will allow us to run the models on loglikelihood-based multiple choice QA tasks like hellaswag
which are currently implemented by taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True
allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion. We then compare the logprobability of each multiple choice answer and say the model chose the one it thinks is most likely. Currently, because logprobs weren't available in openAI's ChatCompletions, we couldn't evaluate on the tasks we defined this way, and would need to write a new task that scores via just asking the model to generate text and checking if it matches the right answer.
The steps to complete this feature would be:
-
Check out a provider that mirrors OpenAI's ChatCompletions API (e.g. Together, who also provide free credits upon signup): https://docs.together.ai/docs/openai-api-compatibility . Does this (and does OpenAI) allow for
echo=True
as a parameter for their chat models, allowing us to get log probabilities on a target string by feeding in (input + target) and subsetting the logprobabilities to only the target part? -
If yes, you can get
echo=True
from Together or other providers who can be used with OpenAI's Python SDK, then port over https://github.com/EleutherAI/lm-evaluation-harness/blob/b69ca72ec3a0294638382e0f90cf32f90d761b44/lm_eval/models/openai_completions.py#L174C2-L226C1 from OpenAICompletionsLM into OpenAIChatCompletionsLM, making changes as appropriate. -
If yes, test this on an open-source model and compare to our Huggingface implementation.
-
If no, then investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.
Hi all, happy to help with this since I'm more familiar with the OpenaiChatCompletionsLM
class after working on it a bit. Let me know if you'd like support or scaffolding for this!
I started by checking to see if OpenAI exposes the echo
command for their chat completion models:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
echo=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
and received
File "pyenv/versions/3.10.0/lib/python3.10/site-packages/openai/_utils/_utils.py", line 272, in wrapper
return func(*args, **kwargs)
TypeError: Completions.create() got an unexpected keyword argument 'echo'`
which makes sense since the CLIChatCompletionCreateArgs
doesn't allow for it
Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?
taking an LM, running it on each (input + answer) and asking it to return the log probs on not just the answer, but also the input (echo=True allows this in OpenAI's Completions API), then taking only the logprobs from the answer portion.
Something I'm missing in understanding: If in the task we only use the logprobs for the response for the task evaluation, what does computing the logprobs of the input string give us in the case of the evaluation, Do we also use this as part of the calculation?
For multi-token continuations/targets, to get the loglikelihood of the whole target, we need to get the logprob of token 0 conditioned on the input/context, the logprob of token 1 of the target conditioned on (context + token 0 of target), and so on.
With echo=True, we feed in (context + continuation) and get out all the logprobs which we can subset to the target string’s loglikelihood.
If we don’t have echo=True, then we can only feed in the inputs. Then if the model does not output token 0, the logprob of token 1 at target position 1 does not correspond to the right thing and we can’t accurately compute the loglikelihood in a (single) API call for multi-token continuations
Thanks, this is super helpful. I also just tested on Together and am getting the same issue, because it's OpenAICompletions compatible:
from openai import OpenAI
system_content = "You are a travel agent. Be descriptive and helpful."
user_content = "Tell me about San Francisco"
client = OpenAI(api_key="TOGETHER_API_KEY",
base_url="https://api.together.xyz/v1",)
chat_completion = client.chat.completions.create(
model="mistralai/Mixtral-8x7B-Instruct-v0.1",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
temperature=0.7,
max_tokens=1024,
echo=True
)
response = chat_completion.choices[0].message.content
print("Together response:\n", response)
TypeError: Completions.create() got an unexpected keyword argument 'echo'
It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them.
Where would we want this output to feed into?
It looks like this would be a good place to start to see how to get them in the completions call? And then we'd want to see how to pull them in here? , the same way we do here
Darn, that makes sense! Thanks for checking on this.
It sounds like, based on this, the next step is: investigate whether we can feed in just (input) to these APIs and still measure the logprobability of a target string from them. Where would we want this output to feed into?
We could conceivably achieve measurement of target string loglikelihood in O(target len) API calls but this is quite expensive.
Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.
Given this, one option may be to add Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?
Orthogonal to this, we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.
Further, since OpenAI only gives up to 5 token logprobs I think, we are limited by this—if the logprobs for our next desired target token do not appear in the top 5 we are out of luck.
Ah, this is good to know, thanks!
Chat Templating support for local OpenAI Completions support once local Completions support is enabled, and get echo=True that way. Would you be willing to take on local-completions as a next step?
Yep, local completions sounds good: want to make sure I'm thinking the right way:
- Add case statements in here for branching -if it's a local model, set echo=True and test.
For this piece:
we should think about how we want to support greedy-gen exact match vs. multichoice loglikelihood for multiple choice tasks to have both variants in some unified way.
Sounds like possibly a separate issue to file and link to this one?
Yep, local completions sounds good: want to make sure I'm thinking the right way:
Add case statements in here for branching -if it's a local model, set echo=True and test
we should always need to use echo=True for remote/local models! the additions needed for local-completions would be ~ just the same as that of local-chat-completions (tokenizer backend selectable between HF/openai and exact tokenizer can be specified by user, plus base_url configurable)
Sounds like possibly a separate issue to file and link to this one?
Definitely, will open it for tracking! Just mentioning aloud that our best solution here is (unfortunately) to forgo loglikelihood entirely perhaps.
Ah ok so these changes would be to enable local-completions, which is not the same as local-chat-completions, but do have the ability to run echo=True (for legacy reasons), to also run locally (just clarifying for myself).
That makes sense, can start on those changes!
That's correct!
Cool. Just started working on this to echo OpenaiChatCompletionsLM
and noticed we removed the tokenizer from that class
- https://github.com/EleutherAI/lm-evaluation-harness/pull/1186/files and
- https://github.com/EleutherAI/lm-evaluation-harness/pull/1191
Are we calling it a different way or excluding it entirely for OpenaiChatCompletionsLM
?
The OpenAI ChatCompletions API actually fully abstracts away its tokenizer, so we don't need a tokenizer for openai-chat-completions
! The same isn't true for Completions, there we still need a tokenizer.
Ok, so here is a fun dilemma:
In starting to implement the code (https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1), I found that, due to this issue,
I needed to update the default base model to gpt-3.5-turbo-instruct
based on their suggested change, but that model doesn't support both logprobs and echo=True! 🤦♀️
A couple ideas:
- I can focus on testing for the local case and set the OpenAI model to None
- set echo but not logprobs
Something else that I'm not aware of that might be helpful here. Let me know what you think.
sigh
I think in this case we should support local Completions models with echo=True
(assuming that these will continue to support echo=True
), and otherwise, for non-local / OpenAI native cases, raise an error saying logits of the form we need are not supported upon trying to run loglikelihood()
or loglikelihood_rolling()
methods.
Hi Hailey and Vicki, I'm still interested on working on this issue with you. Do you believe is there any room left to have another developer jumping on it with Vicki?
Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?
Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?
Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).
This has been a very interesting conversation, thank you @veekaybee @haileyschoelkopf .
On a separate but related note, @haileyschoelkopf do you know how exactly does OpenAI run these benchmarks? Surely they must need the same logprobs that we do. One possibility is they are running against some internal API where they do have access to the logprobs but it's just not opened up.
Openai certainly has access to model logprobs if they desire them (and definitely use them for things like evaluating perplexity).
As mentioned earlier in this thread, multiple choice benchmarks can be measured using generation and exact match, and we’ll probably want to support this — OpenAI’s evals framework likely to some extent exhibits how they do this. https://github.com/openai/evals
Awesome @veekaybee! 🥳
Hey @gmottajr, absolutely! Does the conversation and code here so far make sense?
Feel free to take the branch I pushed and linked to in my comment above and develop from it. Would that work?
Let me know if you need additional pointers or help (and for sure Hailey can also assist better than me in cases of "do we want to do X this way).
Thank you very much for you positive response, @veekaybee. 😄 🚀
That sounds like music to my ears! 🎼 🎵 🤩
I would like to fork the branch you mentioned, but could not really find it.
Yeah, for sure, I do need some additional pointers and help. How does that sound if we could talk directly through Slack, or Discord, Ms Teams, or any other collaboration software more familiar to you?
Looking forward to hear back from you (and jump right away in this code change). 👀
Best regards,
Gerson Jr.
The branch is called logprob-completions
and you can find a reference of it here: https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1
Feel free to find us on the EleutherAI Discord in the #lm-thunderdome channel, same username :)
Hi @veekaybee, I'm not sure why, but all of yours hyperlinks are pointing to a PR instead of where you aim to point to. This link for example, does not take me to any branch, https://github.com/EleutherAI/lm-evaluation-harness/compare/logprob-completions?expand=1. It takes me here:
Same thing is happening with your discord link:
That is why it really looks like there is something weird happening when you are sending link and I am wondering that you might actually trying to point to another URL.
Anyway, I just guessed you probably is talking about this branch here, I'm not sure: https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions.
Please let me know if I got it correct? I'm going to fork from there. I need you sending me again the discord link, please. Thank you very much.
https://github.com/EleutherAI/lm-evaluation-harness/tree/logprob-completions yes, this link is correct! you can visit the #lm-thunderdome channel on discord.gg/eleutherai to ask questions or discuss.
@gmottajr were you able to get started on this? Let me know if you need any help, or I'm happy to put together a sample PR for us to review.
Have you tried using logit_bias param for openai.chat.completions? Like gpt-4. This param allows for generating only those tokens that have been previously biased (with quite a large value). Like the model makes forward and get logits for the currecnt token, then some lavue is added to the particular logits (for example to logits of tokens of letters A, B, C, D for mmlu), then log-probs are computed. gpt-4 does not return the probs for input tokens. But if you pass only ctx, bias possible continuation logits with quite large value, the model will exclusively generate tokens with these biased logits. gpt-4 can return log-probs for generated tokens.
So, it may be the way to make the openai chat model generate only those tokens that we want and get their log-probs that will still be comparable (bias all tokens for the same value).
Had posted this on discord a while back, but cross-posting an update:
Unfortunately logits (with echo=True
) still are not supported by the vast majority of API models, and now logit biases cannot be used to work around this--following https://arxiv.org/abs/2403.06634 https://arxiv.org/abs/2403.09539 logits returned on outputs no longer reflect logit biases.
There are probably other hacks that could be applied to "simulate" ranking-based multiple choice or to try to artificially, say, constrain an OAI chat model to only output A, B, C, or D by using logit biases, but these all seem worse than just using free-form generation to evaluate these "chatty" models.
What we'd like to move toward long-term therefore is to have tasks support multiple "presets" so that one can eval MMLU generatively or using loglikelihoods. The caveat here is that there is a tension between giving users more options to play with and keeping the advantages that one standardized task implementation gives. Haven't yet decided what the right point to strike here is--certainly we need to allow at least a bit more configurability to make the currently-loglikelihood only tasks usable for API or local server models, but also don't want to move too far in the other direction in doing so.
If people do have feedback on this front, it is appreciated--hope this provides context on the logits-for-chat-APIs front though.
Almost all Polish tasks, we have created, are in two versions: multiple choice and generate, e.g. https://github.com/speakleash/lm-evaluation-harness/tree/polish2/lm_eval/tasks/polish_ppc
Results are published on https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard with task name suffixes _g
or _mc
. There is also gpt-3
model tested but only on _g
tasks.
Could we set logprobs
to a large number for vLLM and openai completion API so that we can do the multile choice task using one-token generation?
I am wondering if there is a solution for this. I am using API provider other than openai and using openai schema for that but getting same error 'No support for logits', what's the solution?