langchain
langchain copied to clipboard
Token usage calculation is not working for ChatOpenAI
Token usage calculation is not working for ChatOpenAI.
How to reproduce
from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
AIMessage,
HumanMessage,
SystemMessage
)
chat = ChatOpenAI(model_name="gpt-3.5-turbo")
with get_openai_callback() as cb:
result = chat([HumanMessage(content="Tell me a joke")])
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Successful Requests: {cb.successful_requests}")
print(f"Total Cost (USD): ${cb.total_cost}")
Output:
Total Tokens: 0
Prompt Tokens: 0
Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0
Possible fix
The following patch fixes the issues, but breaks the linter.
From f60afc48c9082fc6b09d69b8c8375353acc9fc0b Mon Sep 17 00:00:00 2001
From: Fabio Perez <[email protected]>
Date: Mon, 3 Apr 2023 19:06:34 -0300
Subject: [PATCH] Fix token usage in ChatOpenAI
---
langchain/chat_models/openai.py | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/langchain/chat_models/openai.py b/langchain/chat_models/openai.py
index c7ee4bd..a8d5fbd 100644
--- a/langchain/chat_models/openai.py
+++ b/langchain/chat_models/openai.py
@@ -274,7 +274,9 @@ class ChatOpenAI(BaseChatModel, BaseModel):
gen = ChatGeneration(message=message)
generations.append(gen)
llm_output = {"token_usage": response["usage"], "model_name": self.model_name}
- return ChatResult(generations=generations, llm_output=llm_output)
+ result = ChatResult(generations=generations, llm_output=llm_output)
+ self.callback_manager.on_llm_end(result, verbose=self.verbose)
+ return result
async def _agenerate(
self, messages: List[BaseMessage], stop: Optional[List[str]] = None
--
2.39.2 (Apple Git-143)
I tried to change the signature of on_llm_end
(langchain/callbacks/base.py) to:
async def on_llm_end(
self, response: Union[LLMResult, ChatResult], **kwargs: Any
) -> None:
but this will break many places, so I'm not sure if that's the best way to fix this issue.
having the same issue... following thread
Is this feature to know about the tokens before the actual execution or after execution ?
For me, after, as in the OP example
That's correct, @kinged007.
@hwchase17 Could you guide me to a possible solution so I can create a PR?
Sorry I am deviating from the problem, should we have something to calculate the tokens before hand as well?
also for embeddings
Related to https://github.com/hwchase17/langchain/pull/1924, pls take a look at the discussion there
is there any active working solution for this? @stephenleo I tried yours but not working, unfortunately.
Still an issue today for me. Code to reproduce.
model_name = 'gpt-4'
with get_openai_callback() as cb:
chat4 = ChatOpenAI(
temperature=0.1,
model=model_name,
)
response = chat4(chat_prompt)
print(cb)
Results:
Tokens Used: 0
Prompt Tokens: 0
Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0
I'm on the JS repo/branch, however, the issue comes from here, I believe.
I'm using
const chatmodel = new ChatOpenAI({
modelName: "gpt-3.5-turbo",
temperature: 0.2,
maxTokens: 200,
streaming: true,
callbacks: [
{
handleLLMEnd: (output) => {
console.log(output); // tokenUsage is empty
},
},
],
});
And I face the same problem. Token usage is a empty object.
I just saw that this also breaks ConversationSummaryBufferMemory
:
const context_chain = ConversationalRetrievalQAChain.fromLLM(chatmodel, vectorStoreRetriever, {
memory: new ConversationSummaryBufferMemory({ returnMessages: true, memoryKey: "chat_history", humanPrefix: "Customer", maxTokenLimit: 1024 }),
verbose: true
});
// Cannot read properties of undefined (reading 'getNumTokens')
I just tested it with gpt-3.5 and 4. Both have this issue. streaming: false
didn't help either. Maybe the API has changed?
@captivus You have to call the model within the context manager for it to work. Since you call it outside the context, the token counting callback is already removed.
Basically indent the call.
Change
model_name = 'gpt-4'
with get_openai_callback() as cb:
chat4 = ChatOpenAI(
temperature=0.1,
model=model_name,
)
response = chat4.predict("foo")
print(cb)
To
model_name = 'gpt-4'
with get_openai_callback() as cb:
chat4 = ChatOpenAI(
temperature=0.1,
model=model_name,
)
response = chat4.predict("foo")
print(cb)
@hinthornw this doesn't work for streaming responses though.
Is there any way to make OpenAICallbackHandler
work with ChatOpenAI(streaming=True)
?
The issue is that on_llm_end
is entered before the response is complete which leads to usage being 0.
This is how I managed to count tokens for streaming: true
with callbacks:
const model = new ChatOpenAI({ modelName: "gpt-3.5-turbo", streaming: true });
const chain = new LLMChain({ llm: model, prompt })
const { text: assistantResponse } = await chain.call({
query: query,
}, {
callbacks: [
{
handleChatModelStart: async (llm, messages) => {
const tokenCount = tokenCounter(messages[0][0].content);
// The prompt is available here: messages[0][0].content
},
handleChainEnd: async (outputs) => {
const { text: outputText } = outputs;
// outputText is the response from the chat call
const tokenCount = tokenCounter(outputText);
}
}
]
}
);
@liowalex I guess we really want the count that OpenAI is returning. Note that langchain will retry failed calls, which will also count towards the token rate limit. So input and output tokens are not the complete picture.
This is how I managed to count tokens for
streaming: true
with callbacks:const model = new ChatOpenAI({ modelName: "gpt-3.5-turbo", streaming: true }); const chain = new LLMChain({ llm: model, prompt }) const { text: assistantResponse } = await chain.call({ query: query, }, { callbacks: [ { handleChatModelStart: async (llm, messages) => { const tokenCount = tokenCounter(messages[0][0].content); // The prompt is available here: messages[0][0].content }, handleChainEnd: async (outputs) => { const { text: outputText } = outputs; // outputText is the response from the chat call const tokenCount = tokenCounter(outputText); } } ] } );
Just curious what is this tokenCounter
?