llama-cpp-python Include usage key in create

Include usage key in create_completion when streaming

Open zhudotexe opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. Since create_completion may yield text chunks comprised of multiple tokens per yield (e.g. in the case of multi-byte Unicode characters), counting the number of yields may not equal the number of tokens actually generated by a model. To accurately get the usage statistics of a streamed completion, one has to run the final text through the tokenizer again, despite create_completion already tracking the number of tokens generated by the model.

Describe the solution you'd like When stream=True in create_completion, the final chunk yielded should include the usage statistics in the 'usage' key.

Describe alternatives you've considered

Saving full generated text and running it through the tokenizer again (seems wasteful)
Counting the number of yields and hoping we don't have any multi-byte characters (hacky and fragile)

Additional context The OpenAI API has recently added similar support in their streaming API with the stream_options key: https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options

May 30 '24 18:05 zhudotexe

any updates on this? Seems that very helpful for our situation as well as the committer' desc

Mar 27 '25 08:03 hh23485

Any updates?

May 23 '25 11:05 hh23485

llama-cpp-python llama-cpp-python copied to clipboard

Include usage key in create_completion when streaming

llama-cpp-python
llama-cpp-python copied to clipboard