text-generation-inference
text-generation-inference copied to clipboard
HF web service streaming response differs from OpenAI, breaking clients
System Info
Attempting to reuse an existing OpenAI client to stream responses from HF endpoint doesn't work due to a couple of differences. In my case the differences break the .NET client in Azure AI SDK, though I suspect it might affect other clients too.
Differences found:
- When streaming response tokens, OpenAI terminates the stream with a final
[DONE]
string, while HF simply stops sending tokens. Clients expecting[DONE]
get stuck waiting either for another token of for the termination string. - OpenAI supports '0.0 <= top_p <= 1.0', while HF supports only '0.0 < top_p < 1.0'
- When sending top_p = 0 to HF endpoint, the service replies
200 OK
with an error{"error":"Input validation error:
top_pmust be > 0.0 and < 1.0","error_type":"validation"}
and no final[DONE]
. Given the status code and the lack of a termination, the error is parsed as data and causes a client to hang, waiting for the next token.
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
Example 1: error with top_p = 0
Request:
curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
-H "Authorization: Bearer ${HF_KEY}" \
-H "Content-Type: application/json" \
-d '{"messages":[{"content":"how much is 1+1","role":"system"}],
"max_tokens":50,
"temperature":0,
"top_p":0.0,
"presence_penalty":0,
"frequency_penalty":0,
"stream":true,
"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'
Response:
< HTTP/2 200
< date: Tue, 14 May 2024 22:33:46 GMT
< content-type: text/event-stream
< x-compute-type: 2-a100
< x-request-id: ...
< cache-control: no-cache
< access-control-allow-credentials: true
< vary: origin, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
< x-accel-buffering: no
< access-control-allow-origin: *
< x-compute-characters: 67
< x-sha: ...
<
data:{"error":"Input validation error: `top_p` must be > 0.0 and < 1.0","error_type":"validation"}
OpenAI returns a response instead (see next).
Example 2: OpenAI response includes '[DONE]`
Request:
curl -X POST https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer ${OPENAI_KEY}" \
-H "Content-Type: application/json" \
-d '{"messages":[{"content":"how much is 1+1","role":"system"}],
"max_tokens":5,
"temperature":0,
"top_p":0,
"presence_penalty":0,
"frequency_penalty":0,
"stream":true,
"model":"gpt-3.5-turbo"}'
Response:
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" +"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" equals"},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"length"}]}
data: [DONE]
Example 3: HF response is missing '[DONE]`
Request:
curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
-H "Authorization: Bearer ${HF_KEY}" \
-H "Content-Type: application/json" \
-d '{"messages":[{"content":"how much is 1+1","role":"system"}],
"max_tokens":5,
"temperature":0,
"top_p":0.01,
"presence_penalty":0,
"frequency_penalty":0,
"stream":true,
"model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'
Response:
data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" result"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" of"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"logprobs":null,"finish_reason":null}]}
data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" mathematical"},"logprobs":null,"finish_reason":"length"}]}
Expected behavior
Would be great if it was possible to reuse OpenAI clients (and apps built on these clients) simply by pointing them at https://api-inference.huggingface.co
.
While it's possible to workaround the different range of top_p changing the code (if apps allow for it), the lack of termination strings makes it impossible to use these clients.