sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[enh] add '\n' for non-streaming case to keep connection alive

Open Kevin-XiongC opened this issue 10 months ago • 5 comments

Motivation

According to Deepseek’s official documentation, for non-streaming requests, the API continuously returns empty lines to enhance the user experience of reasoning models.

Modifications

This pull request modifies the output to stream responses for non-streaming requests, aligning with the behavior of the Deepseek official API. It utilizes asyncio.wait_for to monitor the tokenizer process and yields a newline character (\n) every 30 seconds if the engine has not completed its task.

Checklist

  • [x] Format your code according to the Code Formatting with Pre-Commit.
  • [ ] Add unit tests as outlined in the Running Unit Tests.
  • [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Kevin-XiongC avatar Feb 19 '25 08:02 Kevin-XiongC

Official Deepseek API image Sglang after this PR image they both return extra\ns to keep connection alive.


from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id

messages = [{"role": "user", "content": "Please prove Euler theorem in Graph Theory as a professional mathematic teacher."}]
response = client.chat.completions.create(model=model, messages=messages, stream=False)
print("content: ", response.choices[0].message.content)

As the figure shows below, using OpenAI's SDK is still valid with extra \ns. image

Kevin-XiongC avatar Feb 19 '25 10:02 Kevin-XiongC

I am not sure if I cache your point, how will additional \n enhance behavior? My concern is that since adapter is for all LLMs and this change is specific to DeepSeek-R1, will it cause other LLMs to behave abnormally.

FrankLeeeee avatar Feb 21 '25 02:02 FrankLeeeee

I am not sure if I cache your point, how will additional \n enhance behavior? My concern is that since adapter is for all LLMs and this change is specific to DeepSeek-R1, will it cause other LLMs to behave abnormally.

The enhanced user experience is achieved by maintaining persistent connections (keep-alive) in non-streaming scenarios, which is beneficial for reasoning models and other large-scale models handling heavy workloads. Without this mechanism, users would consistently encounter frequent failures, as evidenced by our actual serving scenarios.

Kevin-XiongC avatar Feb 21 '25 03:02 Kevin-XiongC

I see, let's say we are serving a relatively much smaller model such as Qwen-7B, will this kind of mechanism be still suitable? It would better if we can make this feature as an argument in ServerArgs, for example, --keep-non-streaming-connection-alive.

FrankLeeeee avatar Feb 21 '25 03:02 FrankLeeeee

@zhyncs do you have any suggestion?

FrankLeeeee avatar Feb 21 '25 03:02 FrankLeeeee

I see, let's say we are serving a relatively much smaller model such as Qwen-7B, will this kind of mechanism be still suitable? It would better if we can make this feature as an argument in ServerArgs, for example, --keep-non-streaming-connection-alive.

Small models should generate results within the timeout period, thereby not adversely affecting the user experience for those using smaller models.

Kevin-XiongC avatar Feb 24 '25 06:02 Kevin-XiongC

This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar May 30 '25 08:05 github-actions[bot]