LocalAI
LocalAI copied to clipboard
[SSE] The chat completion chunk object from the /chat/completions API endpoint contains a single character instead of a complete token
LocalAI version: https://github.com/mudler/LocalAI/releases/tag/v2.12.4
Environment, CPU architecture, OS, and Version: CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy
Describe the bug
The chat completion chunk object of the /chat/completions
API endpoint does not return one token per object. Instead, one token is split into multiple characters and only a single character per object is returned.
To Reproduce
Make a POST
request against the /chat/completions
API endpoint with the enabled stream
option.
curl https://[...]/v1/chat/completions \
--header "Content-Type: application/json" \
--data '{
"model": "openchat-3.5-0106.Q5_K_M.gguf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Write me a poem about love, death and robots."
}
],
"stream": true
}'
Expected behavior
The /chat/completions
API endpoint should work exactly like the OpenAI REST-API and submit one token per data object:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" In"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" World"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}
[...]
data: [DONE]
Logs
Streaming output of LocalAI:
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"I"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"n"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":" "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"a"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":" "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"w"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"o"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"r"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"l"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"d"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[...]
data: [DONE]
Additional context
OpenAI Documentation: https://platform.openai.com/docs/api-reference/chat/streaming
Thank you for reporting this issue, countzero. From the information provided, it seems the issue lies in the LocalAI implementation of the /chat/completions
API endpoint, as opposed to the OpenAI API itself. The chunk objects returned in the streaming response should contain one token per object, as you've expected, but are currently splitting a token into multiple characters and returning only a single character per object.
I will make sure to bring this issue to the attention of the developers on the LocalAI team. They will investigate this bug and work on a fix to ensure the API endpoint behaves as expected. In the meantime, if this is a critical issue for your use case, you may consider using the OpenAI API directly or another alternative until the issue is resolved in LocalAI.
We appreciate your help in keeping our projects robust and bug-free. If you have any further questions or concerns, please don't hesitate to reach out.
Best regards,
The Github Bot of @mudler (Experiment)
@mudler The primary use case is to get usage stats on the client side with the streaming API. I see two combinable changes that would greatly optimize the streaming behaviour of the /chat/completions
API endpoint:
Solution A - stream tokens instead of characters
Make the endpoint 100% compatible with the current OpenAI implementation by streaming tokens. This reduces the amount of data overhead and at least enables the count of generated tokens on the consumer side.
The downside is, that we still do not get a prompt tokens count: https://community.openai.com/t/openai-api-get-usage-tokens-in-response-when-set-stream-true/141866
Solution B - add usage object to last chunk
I looked into the way llama.cpp implemented the usage statistic into their OpenAI compatible streaming API and quite liked their solution. They simply add a regular usage object to the last data chunk:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<|end_of_turn|>"}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}
data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk","usage":{"completion_tokens":18,"prompt_tokens":28,"total_tokens":46}}
I would love to see both changes in an upcoming version of LocalAI!