langchain get_openai_callback dosen't work with streaming = True

System Info

0.0.166

Who can help?

@agola11

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[X] Callbacks/Tracing
[X] Async

Reproduction

from langchain.prompts import PromptTemplate from langchain.chains import LLMChain from langchain.chat_models import ChatOpenAI from langchain.callbacks import get_openai_callback

question = "What is the answer of the meaning of life?"
prompt = PromptTemplate(
    input_variables=["input"],
    template="{input}",
)
llm = ChatOpenAI(temperature=0.7, max_tokens=2000, streaming=True)
chain = LLMChain(llm=llm, prompt=prompt)

with get_openai_callback() as cb:
    print(chain.run(question))
    print("\n\n")
    print(cb)

result

As an AI language model, I do not have a personal belief system or opinion, and therefore, I do not have an answer to this question. The meaning of life is a philosophical and subjective topic that varies from person to person. It is up to individuals to find their own purpose and meaning in life.



Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 1
Total Cost (USD): $0.0

when set streaming = False, it works.

Expected behavior

should return with token usage info with streaming = True or False

May 12 '23 15:05 shadowlinyf

related issue https://github.com/hwchase17/langchain/issues/3114

May 12 '23 15:05 shadowlinyf

Hi,

In order to solve this bug I created my own async and cost calculator 'handler'(require tiktoken dependency):

from langchain.callbacks.base import AsyncCallbackHandler
from langchain.schema import LLMResult
from typing import Any, Dict, List
import tiktoken

MODEL_COST_PER_1K_TOKENS = {
    "gpt-4": 0.03,
    "gpt-4-0314": 0.03,
    "gpt-4-completion": 0.06,
    "gpt-4-0314-completion": 0.06,
    "gpt-4-32k": 0.06,
    "gpt-4-32k-0314": 0.06,
    "gpt-4-32k-completion": 0.12,
    "gpt-4-32k-0314-completion": 0.12,
    "gpt-3.5-turbo": 0.002,
    "gpt-3.5-turbo-0301": 0.002,
    "text-ada-001": 0.0004,
    "ada": 0.0004,
    "text-babbage-001": 0.0005,
    "babbage": 0.0005,
    "text-curie-001": 0.002,
    "curie": 0.002,
    "text-davinci-003": 0.02,
    "text-davinci-002": 0.02,
    "code-davinci-002": 0.02,
}

class TokenCostProcess:
    total_tokens: int = 0
    prompt_tokens: int = 0
    completion_tokens: int = 0
    successful_requests: int = 0

    def sum_prompt_tokens( self, tokens: int ):
      self.prompt_tokens = self.prompt_tokens + tokens
      self.total_tokens = self.total_tokens + tokens

    def sum_completion_tokens( self, tokens: int ):
      self.completion_tokens = self.completion_tokens + tokens
      self.total_tokens = self.total_tokens + tokens

    def sum_successful_requests( self, requests: int ):
      self.successful_requests = self.successful_requests + requests

    def get_openai_total_cost_for_model( self, model: str ) -> float:
       return MODEL_COST_PER_1K_TOKENS[model] * self.total_tokens / 1000
    
    def get_cost_summary(self, model:str) -> str:
        cost = self.get_openai_total_cost_for_model(model)

        return (
            f"Tokens Used: {self.total_tokens}\n"
            f"\tPrompt Tokens: {self.prompt_tokens}\n"
            f"\tCompletion Tokens: {self.completion_tokens}\n"
            f"Successful Requests: {self.successful_requests}\n"
            f"Total Cost (USD): {cost}"
        )

class CostCalcAsyncHandler(AsyncCallbackHandler):
    model: str = ""
    socketprint = None
    websocketaction: str = "appendtext"
    token_cost_process: TokenCostProcess

    def __init__( self, model, token_cost_process ):
       self.model = model
       self.token_cost_process = token_cost_process

    def on_llm_start( self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) -> None:
       encoding = tiktoken.encoding_for_model( self.model )

       if self.tocken_cost_process == None: return

       for prompt in prompts:
          self.tocken_cost_process.sum_prompt_tokens( len(encoding.encode(prompt)) )

    async def on_llm_new_token(self, token: str, **kwargs) -> None:
      print( token )

      self.tocken_cost_process.sum_completion_tokens( 1 )

    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
      self.tocken_cost_process.sum_successful_requests( 1 )

I use this way:

token_cost_process = TokenCostProcess()

chat = ChatOpenAI(
        streaming=True, 
        callbacks=[  CostCalcAsyncHandler( "gpt-3.5-turbo", token_cost_process ) ], 
        temperature= 0
        model_name = "gpt-3.5-turbo"
)
...

print( token_cost_process.get_cost_summary( "gpt-3.5-turbo" ) )

I hope this helps anyone.

Thanks,

May 31 '23 12:05 acoronadoc

Do you know if there's any progress on that?

Jun 06 '23 11:06 MichalBortkiewicz

I have no idea Michal! I know noone from the project, my contribution it was only as an user

Jun 07 '23 11:06 acoronadoc

@hwchase17 @dev2049 The solution by @acoronadoc for cost tracking here seems pretty solid for streaming chat. Do you agree? We could make a PR for it.

Jun 09 '23 15:06 pors

wondering about the same (and thanks for the contribution!)

Jun 16 '23 03:06 zac-li

LangSmith records streaming tokens. Any Progress?

Sep 08 '23 06:09 lvisdd

A variation inspired by @acoronadoc

from typing import Any, Dict, List

import tiktoken
from langchain.callbacks.base import AsyncCallbackHandler
from langchain.callbacks.openai_info import (
    MODEL_COST_PER_1K_TOKENS,
    get_openai_token_cost_for_model,
)
from langchain.schema import LLMResult


class Cost:
    def __init__(
        self, total_cost=0, total_tokens=0, prompt_tokens=0, completion_tokens=0
    ):
        self.total_cost = total_cost
        self.total_tokens = total_tokens
        self.prompt_tokens = prompt_tokens
        self.completion_tokens = completion_tokens

    def __str__(self):
        return (
            f"Tokens Used: {self.total_tokens}\n"
            f"\tPrompt Tokens: {self.prompt_tokens}\n"
            f"\tCompletion Tokens: {self.completion_tokens}\n"
            f"Total Cost (USD): ${self.total_cost}"
        )

    def add_prompt_tokens(self, prompt_tokens):
        self.prompt_tokens += prompt_tokens
        self.total_tokens += prompt_tokens

    def add_completion_tokens(self, completion_tokens):
        self.completion_tokens += completion_tokens
        self.total_tokens += completion_tokens


class CostCalcCallbackHandler(AsyncCallbackHandler):
    def __init__(self, model_name, cost, *args, **kwargs):
        self.model_name = model_name
        self.encoding = tiktoken.encoding_for_model(model_name)
        self.cost = cost
        super().__init__(*args, **kwargs)

    async def on_llm_start(
        self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any
    ) -> None:
        for prompt in prompts:
            self.cost.add_prompt_tokens(len(self.encoding.encode(prompt)))

    async def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        for generation_list in response.generations:
            for generation in generation_list:
                self.cost.add_completion_tokens(
                    len(self.encoding.encode(generation.text))
                )
        if self.model_name in MODEL_COST_PER_1K_TOKENS:
            prompt_cost = get_openai_token_cost_for_model(
                self.model_name, self.cost.prompt_tokens
            )
            completion_cost = get_openai_token_cost_for_model(
                self.model_name, self.cost.completion_tokens, is_completion=True
            )
            self.cost.total_cost = prompt_cost + completion_cost

Example usage:

import asyncio

from langchain.chat_models import ChatOpenAI

model_name = "gpt-3.5-turbo"
cost = Cost()
chat = ChatOpenAI(
    streaming=True,
    callbacks=[CostCalcCallbackHandler(model_name, cost)],
    temperature=0,
    model_name=model_name,
)
asyncio.run(chat.apredict(text="What's 2+2?"))
print(cost)

# Tokens Used: 17
#         Prompt Tokens: 9
#         Completion Tokens: 8
# Total Cost (USD): $2.95e-05

Sep 21 '23 10:09 nick-solly

Hi, @shadowlinyf,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised regarding the get_openai_callback function not working with streaming=True has been confirmed and discussed by several users, including acoronadoc, pors, MichalBortkiewicz, and nick-solly. Potential solutions and variations have been proposed, but the issue remains unresolved.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to LangChain!

Dec 21 '23 16:12 dosubot[bot]

Hi, @shadowlinyf,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. The issue you raised regarding the get_openai_callback function not working with streaming=True has been confirmed and discussed by several users, including acoronadoc, pors, MichalBortkiewicz, and nick-solly. Potential solutions and variations have been proposed, but the issue remains unresolved.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to LangChain!

I am still having this issue with streaming

Jan 24 '24 10:01 SamPink

openai_info

@nick-solly @acoronadoc Does this still work with the new ChatOpenAI (from langchain_openai import ChatOpenAI) ?

I'm using the above callback code/setup on an OpenAI Functions Agent and I am seeing a token count, but when I compare it to the Token Count on the LangSmith trace, I see a different count of the tokens. Do you know why??

In LangSmith i see: 1671 tokens

And in my Callback print i see: 1094 tokens

Any know where the discrepancy of the count is? / How to fix?

@hwchase17 Any thoughts on this? How does LangSmith go about calculating Token counts?

Feb 01 '24 14:02 hgoona

I have a solution by set stream_usage=True when you initiate ChatOpenAI instance, then wrap in get_openai_callback()

from langchain_openai import ChatOpenAI
from langchain_community.callbacks import get_openai_callback

llm = ChatOpenAI(temperature=0, stream_usage=True)
with get_openai_callback() as cb:
    for chunk in chain.stream("What is Map Reduce"):
        print(chunk)
    print(cb.prompt_tokens)
    print(cb.completion_tokens)

Jul 30 '24 10:07 thanhtrung5763

langchain langchain copied to clipboard

get_openai_callback dosen't work with streaming = True

System Info

Who can help?

Information

Related Components

Reproduction

Expected behavior

langchain
langchain copied to clipboard