sglang icon indicating copy to clipboard operation
sglang copied to clipboard

"WARNING: Invalid HTTP request received" and latency SGLANG vs VLLM

Open jmlb opened this issue 1 year ago • 6 comments

Hi team, I am using sglang with a local finetuned model (basemodel_id = cognitivecomputations/dolphin-2.2.1-mistral-7b). And running inference in a for loop. GPU: 4090 batch_sz=1 tokens_in ~ 2000 tokens_out ~200

runtime = load_model(model_id)
for p in tqdm(prompts):
   resp = inference_sglang(p)

runtime.shutdown()

When the model is loaded I am getting: WARNING: Invalid HTTP request received. which repeats itself until the code reaches the line runtime.shutdown()

  1. Why am I getting this warning?
  2. does it affect inference time? I ran the same prompts with vllm and inference times are very similar for sglang and vllm. My prompts are single instruction (no multi-shot prompting as in your code examples): <s><[INST]{my_instruction}[/INST] is that a case, where sglang should show better performance than vllm?

Thank you

jmlb avatar Jan 20 '24 18:01 jmlb

  1. Can you share your code for sending the HTTP request? Can you correctly run this example? https://github.com/sgl-project/sglang?tab=readme-ov-file#usage. The warning is unexpected.
  2. sglang outperforms vllm because of its ability to automatically reuse the KV cache of the shared prefixes. In your case, the length of the shared prefix in your prompts seems very short, so sglang runtime and vllm will show very similar performance. If later you add long system prompts or multi-shot examples, sglang can outperform in terms of the first-token latency and throughput.

merrymercy avatar Jan 21 '24 10:01 merrymercy

  1. Can you share your code for sending the HTTP request? Can you correctly run this example? https://github.com/sgl-project/sglang?tab=readme-ov-file#usage. The warning is unexpected.
  2. sglang outperforms vllm because of its ability to automatically reuse the KV cache of the shared prefixes. In your case, the length of the shared prefix in your prompts seems very short, so sglang runtime and vllm will show very similar performance. If later you add long system prompts or multi-shot examples, sglang can outperform in terms of the first-token latency and throughput.

I tried to compare vllm and sglang, but my inference times are also similar, even with long(~1000) system prompt. Could you give an example that can easily show sglang outperforms vllm? @merrymercy

cyLi-Tiger avatar Jan 25 '24 08:01 cyLi-Tiger

@cyLi-Tiger You can try this MMLU https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu, or multi-turn chat https://github.com/sgl-project/sglang/tree/main/benchmark/multi_turn_chat

merrymercy avatar Jan 25 '24 10:01 merrymercy

Thank you! Those benchmarks helped a lot. But I found an error when I ran multi_turn_chat if I changed args.turns into 6 in bench_sglang.py. Here is the log:

Traceback (most recent call last): File "/root/anaconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/anaconda3/envs/sglang/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/mnt/workspace/sglang-main/python/sglang/srt/managers/detokenizer_manager.py", line 86, in start_detokenizer_process loop.run_until_complete(manager.handle_loop()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/mnt/workspace/sglang-main/python/sglang/srt/managers/detokenizer_manager.py", line 41, in handle_loop output_strs = self.tokenizer.batch_decode( File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3716, in batch_decode return [ File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3717, in self.decode( File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3756, in decode return self._decode( File "/root/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/tokenization_qwen.py", line 245, in _decode token_ids = [i for i in token_ids if i < self.eod_id] File "/root/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/tokenization_qwen.py", line 245, in token_ids = [i for i in token_ids if i < self.eod_id] TypeError: '<' not supported between instances of 'NoneType' and 'int'

cyLi-Tiger avatar Jan 26 '24 03:01 cyLi-Tiger

These scripts are only tested with llama and mixtral. For QWen, probably we need to do some small fixes. @cyLi-Tiger Could you share your commands? @hnyls2002 Could you take a look at the errors?

merrymercy avatar Jan 26 '24 06:01 merrymercy

My command is python3 bench_sglang.py --tokenizer my_local_qwen_model_path, and I just add args.turns = 6 before mainfunction

cyLi-Tiger avatar Jan 29 '24 03:01 cyLi-Tiger

@cyLi-Tiger It works on my setup. Please try the latest main branch.

Commands

Launch server

python -m sglang.launch_server --model-path Qwen/Qwen-14B-Chat --port 30000 --trust --tp 2

Run benchmark

python3 bench_sglang.py --tokenizer Qwen/Qwen-7B-Chat --trust --turn 6

For new issues, please open a new PR.

Results

On my setup, Qwen-14B-Chat + 2 x A10G (24GB), sglang is 2x faster than vllm. image

merrymercy avatar Jan 30 '24 15:01 merrymercy