sglang
sglang copied to clipboard
"WARNING: Invalid HTTP request received" and latency SGLANG vs VLLM
Hi team,
I am using sglang
with a local finetuned model (basemodel_id = cognitivecomputations/dolphin-2.2.1-mistral-7b
). And running inference in a for loop.
GPU: 4090
batch_sz=1
tokens_in ~ 2000
tokens_out ~200
runtime = load_model(model_id)
for p in tqdm(prompts):
resp = inference_sglang(p)
runtime.shutdown()
When the model is loaded I am getting:
WARNING: Invalid HTTP request received.
which repeats itself until the code reaches the line runtime.shutdown()
- Why am I getting this warning?
- does it affect inference time?
I ran the same prompts with
vllm
and inference times are very similar forsglang
andvllm
. My prompts are single instruction (no multi-shot prompting as in your code examples):<s><[INST]{my_instruction}[/INST]
is that a case, where sglang should show better performance thanvllm
?
Thank you
- Can you share your code for sending the HTTP request? Can you correctly run this example? https://github.com/sgl-project/sglang?tab=readme-ov-file#usage. The warning is unexpected.
- sglang outperforms vllm because of its ability to automatically reuse the KV cache of the shared prefixes. In your case, the length of the shared prefix in your prompts seems very short, so sglang runtime and vllm will show very similar performance. If later you add long system prompts or multi-shot examples, sglang can outperform in terms of the first-token latency and throughput.
- Can you share your code for sending the HTTP request? Can you correctly run this example? https://github.com/sgl-project/sglang?tab=readme-ov-file#usage. The warning is unexpected.
- sglang outperforms vllm because of its ability to automatically reuse the KV cache of the shared prefixes. In your case, the length of the shared prefix in your prompts seems very short, so sglang runtime and vllm will show very similar performance. If later you add long system prompts or multi-shot examples, sglang can outperform in terms of the first-token latency and throughput.
I tried to compare vllm and sglang, but my inference times are also similar, even with long(~1000) system prompt. Could you give an example that can easily show sglang outperforms vllm? @merrymercy
@cyLi-Tiger You can try this MMLU https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu, or multi-turn chat https://github.com/sgl-project/sglang/tree/main/benchmark/multi_turn_chat
Thank you! Those benchmarks helped a lot. But I found an error when I ran multi_turn_chat if I changed args.turns
into 6 in bench_sglang.py. Here is the log:
Traceback (most recent call last): File "/root/anaconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/mnt/workspace/sglang-main/python/sglang/srt/managers/detokenizer_manager.py", line 86, in start_detokenizer_process loop.run_until_complete(manager.handle_loop()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/mnt/workspace/sglang-main/python/sglang/srt/managers/detokenizer_manager.py", line 41, in handle_loop output_strs = self.tokenizer.batch_decode( File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3716, in batch_decode return [ File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3717, in
self.decode( File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3756, in decode return self._decode( File "/root/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/tokenization_qwen.py", line 245, in _decode token_ids = [i for i in token_ids if i < self.eod_id] File "/root/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/tokenization_qwen.py", line 245, in token_ids = [i for i in token_ids if i < self.eod_id] TypeError: '<' not supported between instances of 'NoneType' and 'int'
These scripts are only tested with llama and mixtral. For QWen, probably we need to do some small fixes. @cyLi-Tiger Could you share your commands? @hnyls2002 Could you take a look at the errors?
My command is python3 bench_sglang.py --tokenizer my_local_qwen_model_path
, and I just add args.turns = 6 before main
function
@cyLi-Tiger It works on my setup. Please try the latest main branch.
Commands
Launch server
python -m sglang.launch_server --model-path Qwen/Qwen-14B-Chat --port 30000 --trust --tp 2
Run benchmark
python3 bench_sglang.py --tokenizer Qwen/Qwen-7B-Chat --trust --turn 6
For new issues, please open a new PR.
Results
On my setup, Qwen-14B-Chat + 2 x A10G (24GB), sglang is 2x faster than vllm.