llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Include usage key in create_completion when streaming

Open zhudotexe opened this issue 1 year ago • 2 comments

Is your feature request related to a problem? Please describe. Since create_completion may yield text chunks comprised of multiple tokens per yield (e.g. in the case of multi-byte Unicode characters), counting the number of yields may not equal the number of tokens actually generated by a model. To accurately get the usage statistics of a streamed completion, one has to run the final text through the tokenizer again, despite create_completion already tracking the number of tokens generated by the model.

Describe the solution you'd like When stream=True in create_completion, the final chunk yielded should include the usage statistics in the 'usage' key.

Describe alternatives you've considered

  • Saving full generated text and running it through the tokenizer again (seems wasteful)
  • Counting the number of yields and hoping we don't have any multi-byte characters (hacky and fragile)

Additional context The OpenAI API has recently added similar support in their streaming API with the stream_options key: https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options

zhudotexe avatar May 30 '24 18:05 zhudotexe