LocalAI [SSE] The chat completion chunk object from the /chat/completions API endpoint contains a single character instead of a complete token

LocalAI version: https://github.com/mudler/LocalAI/releases/tag/v2.12.4

Environment, CPU architecture, OS, and Version: CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz Distributor ID: Ubuntu Description: Ubuntu 22.04.3 LTS Release: 22.04 Codename: jammy

Describe the bug The chat completion chunk object of the /chat/completions API endpoint does not return one token per object. Instead, one token is split into multiple characters and only a single character per object is returned.

To Reproduce Make a POST request against the /chat/completions API endpoint with the enabled stream option.

curl https://[...]/v1/chat/completions \
    --header "Content-Type: application/json" \
    --data '{
        "model": "openchat-3.5-0106.Q5_K_M.gguf",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Write me a poem about love, death and robots."
            }
        ],
        "stream": true
    }'

Expected behavior

The /chat/completions API endpoint should work exactly like the OpenAI REST-API and submit one token per data object:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" In"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" a"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" World"}}],"created":1713786276,"id":"chatcmpl-Qg7RMR4tBkKdFR2FxoHNduEWWuFSbqIg","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}

[...]

data: [DONE]

Logs

Streaming output of LocalAI:

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"I"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"n"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":" "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"a"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":" "}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"w"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"o"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"r"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"l"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

data: {"created":1713774071,"object":"chat.completion.chunk","id":"4dea3a0f-710d-45c0-912e-f20c60967145","model":"openchat-3.5-0106.Q5_K_M.gguf","choices":[{"index":0,"finish_reason":"","delta":{"content":"d"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

[...]

data: [DONE]

Additional context

OpenAI Documentation: https://platform.openai.com/docs/api-reference/chat/streaming

Apr 22 '24 12:04 countzero

Thank you for reporting this issue, countzero. From the information provided, it seems the issue lies in the LocalAI implementation of the /chat/completions API endpoint, as opposed to the OpenAI API itself. The chunk objects returned in the streaming response should contain one token per object, as you've expected, but are currently splitting a token into multiple characters and returning only a single character per object.

I will make sure to bring this issue to the attention of the developers on the LocalAI team. They will investigate this bug and work on a fix to ensure the API endpoint behaves as expected. In the meantime, if this is a critical issue for your use case, you may consider using the OpenAI API directly or another alternative until the issue is resolved in LocalAI.

We appreciate your help in keeping our projects robust and bug-free. If you have any further questions or concerns, please don't hesitate to reach out.

Best regards,

The Github Bot of @mudler (Experiment)

Apr 22 '24 12:04 localai-bot

@mudler The primary use case is to get usage stats on the client side with the streaming API. I see two combinable changes that would greatly optimize the streaming behaviour of the /chat/completions API endpoint:

Solution A - stream tokens instead of characters

Make the endpoint 100% compatible with the current OpenAI implementation by streaming tokens. This reduces the amount of data overhead and at least enables the count of generated tokens on the consumer side.

The downside is, that we still do not get a prompt tokens count: https://community.openai.com/t/openai-api-get-usage-tokens-in-response-when-set-stream-true/141866

Solution B - add usage object to last chunk

I looked into the way llama.cpp implemented the usage statistic into their OpenAI compatible streaming API and quite liked their solution. They simply add a regular usage object to the last data chunk:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"<|end_of_turn|>"}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1713786803,"id":"chatcmpl-pmnNZxAQ2B1JYX9AZfJXCe0uwII0Aejx","model":"openchat-3.5-0106.Q5_K_M.gguf","object":"chat.completion.chunk","usage":{"completion_tokens":18,"prompt_tokens":28,"total_tokens":46}}

I would love to see both changes in an upcoming version of LocalAI!

Apr 22 '24 13:04 countzero

LocalAI LocalAI copied to clipboard

[SSE] The chat completion chunk object from the /chat/completions API endpoint contains a single character instead of a complete token

LocalAI
LocalAI copied to clipboard