langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Token usage calculation is not working for ChatOpenAI

Open fabioperez opened this issue 1 year ago • 8 comments

Token usage calculation is not working for ChatOpenAI.

How to reproduce

from langchain.callbacks import get_openai_callback
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)
chat = ChatOpenAI(model_name="gpt-3.5-turbo")
with get_openai_callback() as cb:
    result = chat([HumanMessage(content="Tell me a joke")])
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Successful Requests: {cb.successful_requests}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Output:

Total Tokens: 0
Prompt Tokens: 0
Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0

Possible fix

The following patch fixes the issues, but breaks the linter.

From f60afc48c9082fc6b09d69b8c8375353acc9fc0b Mon Sep 17 00:00:00 2001
From: Fabio Perez <[email protected]>
Date: Mon, 3 Apr 2023 19:06:34 -0300
Subject: [PATCH] Fix token usage in ChatOpenAI

---
 langchain/chat_models/openai.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/langchain/chat_models/openai.py b/langchain/chat_models/openai.py
index c7ee4bd..a8d5fbd 100644
--- a/langchain/chat_models/openai.py
+++ b/langchain/chat_models/openai.py
@@ -274,7 +274,9 @@ class ChatOpenAI(BaseChatModel, BaseModel):
             gen = ChatGeneration(message=message)
             generations.append(gen)
         llm_output = {"token_usage": response["usage"], "model_name": self.model_name}
-        return ChatResult(generations=generations, llm_output=llm_output)
+        result = ChatResult(generations=generations, llm_output=llm_output)
+        self.callback_manager.on_llm_end(result, verbose=self.verbose)
+        return result
 
     async def _agenerate(
         self, messages: List[BaseMessage], stop: Optional[List[str]] = None
-- 
2.39.2 (Apple Git-143)

I tried to change the signature of on_llm_end (langchain/callbacks/base.py) to:

    async def on_llm_end(
        self, response: Union[LLMResult, ChatResult], **kwargs: Any
    ) -> None:

but this will break many places, so I'm not sure if that's the best way to fix this issue.

fabioperez avatar Apr 03 '23 22:04 fabioperez

having the same issue... following thread

kinged007 avatar Apr 04 '23 22:04 kinged007

Is this feature to know about the tokens before the actual execution or after execution ?

AshuBik avatar Apr 05 '23 02:04 AshuBik

For me, after, as in the OP example

kinged007 avatar Apr 05 '23 08:04 kinged007

That's correct, @kinged007.

@hwchase17 Could you guide me to a possible solution so I can create a PR?

fabioperez avatar Apr 05 '23 11:04 fabioperez

Sorry I am deviating from the problem, should we have something to calculate the tokens before hand as well?

AshuBik avatar Apr 05 '23 11:04 AshuBik

also for embeddings

Nemunas avatar Apr 06 '23 21:04 Nemunas

Related to https://github.com/hwchase17/langchain/pull/1924, pls take a look at the discussion there

stephenleo avatar Apr 10 '23 02:04 stephenleo

is there any active working solution for this? @stephenleo I tried yours but not working, unfortunately.

rl3250 avatar Apr 14 '23 03:04 rl3250

Still an issue today for me. Code to reproduce.

model_name = 'gpt-4'

with get_openai_callback() as cb:
    chat4 = ChatOpenAI(
        temperature=0.1,
        model=model_name,
    )

response = chat4(chat_prompt)

print(cb)

Results:

Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0

captivus avatar Jun 12 '23 16:06 captivus

I'm on the JS repo/branch, however, the issue comes from here, I believe.

I'm using

const chatmodel = new ChatOpenAI({
    modelName: "gpt-3.5-turbo",
    temperature: 0.2,
    maxTokens: 200,
    streaming: true,
    callbacks: [
        {
            handleLLMEnd: (output) => {
                console.log(output); // tokenUsage is empty
            },
        },
    ],
});

And I face the same problem. Token usage is a empty object.

I just saw that this also breaks ConversationSummaryBufferMemory:

const context_chain = ConversationalRetrievalQAChain.fromLLM(chatmodel, vectorStoreRetriever, {
    memory: new ConversationSummaryBufferMemory({ returnMessages: true, memoryKey: "chat_history", humanPrefix: "Customer", maxTokenLimit: 1024 }),
    verbose: true
});
// Cannot read properties of undefined (reading 'getNumTokens')

I just tested it with gpt-3.5 and 4. Both have this issue. streaming: false didn't help either. Maybe the API has changed?

Fusseldieb avatar Aug 10 '23 21:08 Fusseldieb

@captivus You have to call the model within the context manager for it to work. Since you call it outside the context, the token counting callback is already removed.

Basically indent the call.

Change

model_name = 'gpt-4'

with get_openai_callback() as cb:
    chat4 = ChatOpenAI(
        temperature=0.1,
        model=model_name,
    )

response = chat4.predict("foo")

print(cb)

To

model_name = 'gpt-4'

with get_openai_callback() as cb:
    chat4 = ChatOpenAI(
        temperature=0.1,
        model=model_name,
    )

    response = chat4.predict("foo")

print(cb)

hinthornw avatar Aug 10 '23 22:08 hinthornw

@hinthornw this doesn't work for streaming responses though. Is there any way to make OpenAICallbackHandler work with ChatOpenAI(streaming=True) ? The issue is that on_llm_end is entered before the response is complete which leads to usage being 0.

gustavz avatar Aug 24 '23 06:08 gustavz

This is how I managed to count tokens for streaming: true with callbacks:

const model = new ChatOpenAI({ modelName: "gpt-3.5-turbo", streaming: true });
const chain = new LLMChain({ llm: model, prompt })
const { text: assistantResponse } = await chain.call({
    query: query,
  }, {
    callbacks: [
      {
        handleChatModelStart: async (llm, messages) => {
          const tokenCount = tokenCounter(messages[0][0].content);
          // The prompt is available here: messages[0][0].content
        },
        handleChainEnd: async (outputs) => {
          const { text: outputText } = outputs;
          // outputText is the response from the chat call
          const tokenCount = tokenCounter(outputText);
        }
      }
    ]
  }
);

liowalex avatar Sep 27 '23 19:09 liowalex

@liowalex I guess we really want the count that OpenAI is returning. Note that langchain will retry failed calls, which will also count towards the token rate limit. So input and output tokens are not the complete picture.

14skywalker avatar Oct 17 '23 06:10 14skywalker

This is how I managed to count tokens for streaming: true with callbacks:

const model = new ChatOpenAI({ modelName: "gpt-3.5-turbo", streaming: true });
const chain = new LLMChain({ llm: model, prompt })
const { text: assistantResponse } = await chain.call({
    query: query,
  }, {
    callbacks: [
      {
        handleChatModelStart: async (llm, messages) => {
          const tokenCount = tokenCounter(messages[0][0].content);
          // The prompt is available here: messages[0][0].content
        },
        handleChainEnd: async (outputs) => {
          const { text: outputText } = outputs;
          // outputText is the response from the chat call
          const tokenCount = tokenCounter(outputText);
        }
      }
    ]
  }
);

Just curious what is this tokenCounter ?

RadoslavRadoev avatar Nov 16 '23 07:11 RadoslavRadoev