[enh] add '\n' for non-streaming case to keep connection alive
Motivation
According to Deepseek’s official documentation, for non-streaming requests, the API continuously returns empty lines to enhance the user experience of reasoning models.
Modifications
This pull request modifies the output to stream responses for non-streaming requests, aligning with the behavior of the Deepseek official API. It utilizes asyncio.wait_for to monitor the tokenizer process and yields a newline character (\n) every 30 seconds if the engine has not completed its task.
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [ ] Add unit tests as outlined in the Running Unit Tests.
- [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [ ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
Official Deepseek API
Sglang after this PR
they both return extra
\ns to keep connection alive.
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
messages = [{"role": "user", "content": "Please prove Euler theorem in Graph Theory as a professional mathematic teacher."}]
response = client.chat.completions.create(model=model, messages=messages, stream=False)
print("content: ", response.choices[0].message.content)
As the figure shows below, using OpenAI's SDK is still valid with extra \ns.
I am not sure if I cache your point, how will additional \n enhance behavior? My concern is that since adapter is for all LLMs and this change is specific to DeepSeek-R1, will it cause other LLMs to behave abnormally.
I am not sure if I cache your point, how will additional
\nenhance behavior? My concern is that sinceadapteris for all LLMs and this change is specific to DeepSeek-R1, will it cause other LLMs to behave abnormally.
The enhanced user experience is achieved by maintaining persistent connections (keep-alive) in non-streaming scenarios, which is beneficial for reasoning models and other large-scale models handling heavy workloads. Without this mechanism, users would consistently encounter frequent failures, as evidenced by our actual serving scenarios.
I see, let's say we are serving a relatively much smaller model such as Qwen-7B, will this kind of mechanism be still suitable? It would better if we can make this feature as an argument in ServerArgs, for example, --keep-non-streaming-connection-alive.
@zhyncs do you have any suggestion?
I see, let's say we are serving a relatively much smaller model such as Qwen-7B, will this kind of mechanism be still suitable? It would better if we can make this feature as an argument in
ServerArgs, for example,--keep-non-streaming-connection-alive.
Small models should generate results within the timeout period, thereby not adversely affecting the user experience for those using smaller models.
This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed.