text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

HF web service streaming response differs from OpenAI, breaking clients

Open dluc opened this issue 9 months ago • 0 comments

System Info

Attempting to reuse an existing OpenAI client to stream responses from HF endpoint doesn't work due to a couple of differences. In my case the differences break the .NET client in Azure AI SDK, though I suspect it might affect other clients too.

Differences found:

  1. When streaming response tokens, OpenAI terminates the stream with a final [DONE] string, while HF simply stops sending tokens. Clients expecting [DONE] get stuck waiting either for another token of for the termination string.
  2. OpenAI supports '0.0 <= top_p <= 1.0', while HF supports only '0.0 < top_p < 1.0'
  3. When sending top_p = 0 to HF endpoint, the service replies 200 OK with an error {"error":"Input validation error: top_p must be > 0.0 and < 1.0","error_type":"validation"} and no final [DONE]. Given the status code and the lack of a termination, the error is parsed as data and causes a client to hang, waiting for the next token.

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

Example 1: error with top_p = 0

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":50,
      "temperature":0,
      "top_p":0.0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

< HTTP/2 200
< date: Tue, 14 May 2024 22:33:46 GMT
< content-type: text/event-stream
< x-compute-type: 2-a100
< x-request-id: ...
< cache-control: no-cache
< access-control-allow-credentials: true
< vary: origin, Origin, Access-Control-Request-Method, Access-Control-Request-Headers
< x-accel-buffering: no
< access-control-allow-origin: *
< x-compute-characters: 67
< x-sha: ...
<
data:{"error":"Input validation error: `top_p` must be > 0.0 and < 1.0","error_type":"validation"}

OpenAI returns a response instead (see next).

Example 2: OpenAI response includes '[DONE]`

Request:

curl -X POST https://api.openai.com/v1/chat/completions \
    -H "Authorization: Bearer ${OPENAI_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"gpt-3.5-turbo"}'

Response:

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" +"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" "},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":"1"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{"content":" equals"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-9Ov2YsBUxDJmdENVKFxdZfLMIEJCt","object":"chat.completion.chunk","created":1715726162,"model":"gpt-3.5-turbo-0125","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"length"}]}

data: [DONE]

Example 3: HF response is missing '[DONE]`

Request:

curl -v -X POST https://api-inference.huggingface.co/v1/chat/completions \
    -H "Authorization: Bearer ${HF_KEY}" \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"content":"how much is 1+1","role":"system"}],
      "max_tokens":5,
      "temperature":0,
      "top_p":0.01,
      "presence_penalty":0,
      "frequency_penalty":0,
      "stream":true,
      "model":"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO"}'

Response:

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" result"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" of"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" the"},"logprobs":null,"finish_reason":null}]}

data:{"id":"","object":"text_completion","created":1715726245,"model":"text-generation-inference/Nous-Hermes-2-Mixtral-8x7B-DPO-medusa","system_fingerprint":"2.0.2-sha-dccab72","choices":[{"index":0,"delta":{"role":"assistant","content":" mathematical"},"logprobs":null,"finish_reason":"length"}]}

Expected behavior

Would be great if it was possible to reuse OpenAI clients (and apps built on these clients) simply by pointing them at https://api-inference.huggingface.co.

While it's possible to workaround the different range of top_p changing the code (if apps allow for it), the lack of termination strings makes it impossible to use these clients.

dluc avatar May 14 '24 22:05 dluc