[Bug] Unable to fix model output
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
The performance of sglang is very good. I am comparing the output accuracy of vllm, Hugging Face, and sglang. Using Qwen's model, I set do_sample to false or temperature to 0 to fix the output. Through comparison, the outputs of vllm and the Hugging Face transformer library are consistent. However, sglang does not produce consistent outputs. sglang has _SAMPLING_EPS set to 1e-6. Even when I use temperature=1e-5, I still cannot obtain consistent outputs. sglang produces different outputs with each run. What configurations should be set to make the output of sglang deterministic? sglang的性能非常不错。我在对比vllm、hugging face和sglang的输出精度。 使用千问的模型,使用do_sample等于false或者temperature=0来固定输出。通过对比,vllm和huggiing face transformer库的输出一致。但是sglang不能得到一致的输出。 sglang 设置了_SAMPLING_EPS = 1e-6。我使用temperature=1e-5,仍然不能或者一致的输出。sglang每次运行都会得到不一样的输出。该怎么设置,能固定sglang的输出。
Reproduction
python -m sglang.launch_server --model-path /xx/Qwen1.5-1.8B-Chat --port 30000 --tp 2 --enable-p2p-check --mem-fraction-static 0.7 --chunked-prefill-size 4096
Environment
Python: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090 GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.2, V12.2.91 CUDA Driver Version: 535.54.03 PyTorch: 2.4.0+cu121 sglang: 0.2.15 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.0 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.2 fastapi: 0.112.0 hf_transfer: 0.1.8 huggingface_hub: 0.24.5 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.8.2 uvicorn: 0.30.5 uvloop: 0.19.0 zmq: 26.1.0 vllm: 0.5.5 multipart: 0.0.9 openai: 1.40.2 anthropic: 0.33.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX NODE NODE SYS SYS SYS SYS 0-19,40-59 0 N/A GPU1 PIX X NODE NODE SYS SYS SYS SYS 0-19,40-59 0 N/A GPU2 NODE NODE X PIX SYS SYS SYS SYS 0-19,40-59 0 N/A GPU3 NODE NODE PIX X SYS SYS SYS SYS 0-19,40-59 0 N/A GPU4 SYS SYS SYS SYS X PIX NODE NODE 20-39,60-79 1 N/A GPU5 SYS SYS SYS SYS PIX X NODE NODE 20-39,60-79 1 N/A GPU6 SYS SYS SYS SYS NODE NODE X PIX 20-39,60-79 1 N/A GPU7 SYS SYS SYS SYS NODE NODE PIX X 20-39,60-79 1 N/A
You might be using the method incorrectly. We ensure consistency with transformers through unit tests and CI https://github.com/sgl-project/sglang/blob/c500f96bb16c686ee8ba5d5f1fc716a0bd8e5fff/test/srt/models/test_generation_models.py#L64-L130
You might be using the method incorrectly. We ensure consistency with transformers through unit tests and CI
https://github.com/sgl-project/sglang/blob/c500f96bb16c686ee8ba5d5f1fc716a0bd8e5fff/test/srt/models/test_generation_models.py#L64-L130 max_diff tensor(0.0251) max_diff tensor(0.0225) max_diff tensor(0.0333) hf_outputs.output_strs=[' ________.(\u3000\u3000)\nA. London\nB. Paris\nC. Tokyo\nD. Beijing\n\n答案:A\n考查英文常识.根据', " to go out for a walk. I'm wearing my favorite pair of jeans, a white t-shirt, and a black jacket. My shoes are white sneakers,", ' developing intelligent machines that can perform tasks that typically require human intelligence, such as perception, reasoning, learning, and decision-making. The goal of AI is to create'] srt_outputs.output_strs=[' ________.(\u3000\u3000)\nA. London\nB. Paris\nC. Tokyo\nD. Beijing\n答案:A\n考查英文常识.A', " to go out for a walk. I'm wearing a pair of comfortable sneakers and a light jacket. I'm carrying a small backpack with some snacks and water.", ' developing intelligent machines that can perform tasks that typically require human intelligence, such as perception, reasoning, learning, and decision-making. The goal of AI is to create'] rouge_l_scores=[0.9705882352941176, 0.6222222222222222, 1.0] F FAIL: test_prefill_logits_and_output_strs (main.TestGenerationModels) Traceback (most recent call last): File "/data/xx/sglang-main/test/srt/models/test_generation_models.py", line 123, in test_prefill_logits_and_output_strs self.assert_close_prefill_logits_and_output_strs( File "/data/xx/sglang-main/test/srt/models/test_generation_models.py", line 78, in assert_close_prefill_logits_and_output_strs if model_path == "/data/publish-data/pretrain_models/Qwen1.5-1.8B-Chat": AssertionError: Not all ROUGE-L scores are greater than rouge_threshold=1
测试用例失败了。我观察到前面几个token是可以保持一致的,但是当句子更长的时候,输出不一致的概率是非常高的。 使用相同的模型部署vllm和sglang服务,使用curl命令加上temperature为0去固定输出,在512长度输出时,vllm的输出是固定的,与huggingface是一致的。sglang的输出甚至都不能固定。我不知道是sglang的问题还是我的问题。 在使用单测用例的时候,max_new_tokens=32太小了,应该设置更大的值。
The test case has failed. I observed that the first few tokens remain consistent, but as the sentences get longer, the probability of inconsistent outputs becomes very high. When deploying the same model using vllm and sglang services and using a curl command with temperature set to 0 to fix the output, the output of vllm is consistent at a length of 512, and it matches Hugging Face's output. However, sglang's output cannot even be fixed. I'm not sure if this is an issue with sglang or with my setup.
When using unit test cases, setting max_new_tokens=32 is too small; a larger value should be used
max_diff tensor(0.0251) max_diff tensor(0.0225) max_diff tensor(0.0333) hf_outputs.output_strs=[' ________.(\u3000\u3000)\nA. London\nB. Paris\nC. Tokyo\nD. Beijing\n\n答案:A\n考查英文常识.根据', " to go out for a walk. I'm wearing my favorite pair of jeans, a white t-shirt, and a black jacket. My shoes are white sneakers,", ' developing intelligent machines that can perform tasks that typically require human intelligence, such as perception, reasoning, learning, and decision-making. The goal of AI is to create'] srt_outputs.output_strs=[' ________.(\u3000\u3000)\nA. London\nB. Paris\nC. Tokyo\nD. Beijing\n答案:A\n考查英文常识.A', " to go out for a walk. I'm wearing a pair of comfortable sneakers and a light jacket. I'm carrying a small backpack with some snacks and water.", ' developing intelligent machines that can perform tasks that typically require human intelligence, such as perception, reasoning, learning, and decision-making. The goal of AI is to create'] rouge_l_scores=[0.9705882352941176, 0.6222222222222222, 1.0] F FAIL: test_prefill_logits_and_output_strs (main.TestGenerationModels) Traceback (most recent call last): File "/data/xx/sglang-main/test/srt/models/test_generation_models.py", line 123, in test_prefill_logits_and_output_strs self.assert_close_prefill_logits_and_output_strs( File "/data/xx/sglang-main/test/srt/models/test_generation_models.py", line 78, in assert_close_prefill_logits_and_output_strs if model_path == "/data/publish-data/pretrain_models/Qwen1.5-1.8B-Chat": AssertionError: Not all ROUGE-L scores are greater than rouge_threshold=1
For the accuracy evaluation of SGLang, you can verify it using https://github.com/fw-ai/llm_eval_meta to match the data in the official Llama 3.1 tech report. Regarding the issue you mentioned about not being able to fix when the temperature is set to 0, I will check on it when I have time. Currently, it seems not a high priority.
For the accuracy evaluation of SGLang, you can verify it using https://github.com/fw-ai/llm_eval_meta to match the data in the official Llama 3.1 tech report. Regarding the issue you mentioned about not being able to fix when the temperature is set to 0, I will check on it when I have time. Currently, it seems not a high priority.
The stability and accuracy of a framework should be prioritized over its features and performance. If it cannot ensure consistent output when do_sample is set to false, it would be very difficult for users or businesses to migrate their inference framework from vllm to sglang.
@cherishhh It seems you didn't understand my previous reply. Currently, the eval scores of SGLang and the official scores of Llama 3.1 are consistent, there is no issue with accuracy, and this will not affect its use in a production environment.
@cherishhh It seems you didn't understand my previous reply. Currently, the eval scores of SGLang and the official scores of Llama 3.1 are consistent, there is no issue with accuracy, and this will not affect its use in a production environment.
I understand what you mean. Overall, the evaluation is fine, such as the accuracy tests on various datasets. However, as a deployment engineer, if inconsistent results occur when do_sample is set to false, I might not be able to convince our algorithm colleagues. So, I'm not sure if this issue is due to my improper operation or if others have also experienced it. Currently, there are still relatively few issues with sglang, and I couldn’t find a similar situation after searching the issue list. Additionally, in the unit test cases, each unit test must pass to proceed with CI, so the unit tests must have passed on a specific model or machine. Is max_new_tokens=32 too short? It might be more stringent to consider changing it to 512. Also, I noticed that in the test/runners.py file, you commented out a statement on line 30: "The output of gemma-2-2b from SRT is unstable on the commented prompt." Did your colleagues encounter a phenomenon similar to this?
The output of gemma-2-2b from SRT is unstable on the commented prompt.
Google's Gemma-2 model uses interleaved window attention to reduce computational complexity for long contexts, alternating between local sliding window attention (4K context length) and global attention (8K context length) in every other layer. We enhanced SGLang v0.3 to fully support the 8K context length by leveraging the optimized window attention kernel from FlashInfer kernels (which skips computation instead of masking) and refining our KV cache manager. Other libraries that lack this feature can only run with a 4K context length.
The output of gemma-2-2b from SRT is unstable on the commented prompt.
Google's Gemma-2 model uses interleaved window attention to reduce computational complexity for long contexts, alternating between local sliding window attention (4K context length) and global attention (8K context length) in every other layer. We enhanced SGLang v0.3 to fully support the 8K context length by leveraging the optimized window attention kernel from FlashInfer kernels (which skips computation instead of masking) and refining our KV cache manager. Other libraries that lack this feature can only run with a 4K context length.
Thank you for your patient response.My colleague attended sglang's presentation last time and introduced the flashinfer library. If I have the time, I will also learn about this flashinfer library.
I also noticed that for the qwen model, when the output length exceeds 32, using HFRunner and SRTRunner results in different prediction outcomes.
Not just Qwen, but testing Llama3 also showed this phenomenon.
I also noticed that for the qwen model, when the output length exceeds 32, using HFRunner and SRTRunner results in different prediction outcomes.
Not just Qwen, testing Llama3 8b also showed this phenomenon.
update gemma 2 9b it does this too but gemma 2 27b it works fine.
having this issue with 70b bf16, 405b bf16 and fp8. is there a root cause analysis on this ?
@cherishhh @Abdulhanan535 @tanmaylaud You can take a look at the implementation of https://github.com/fw-ai/llm_eval_meta/blob/main/analyze_answers.py. When evaluating, to determine if the model answer is correct, you can refer to https://github.com/fw-ai/llm_eval_meta/blob/b1166abf1395eafd3a994aefed5f6a420e697289/analyze_answers.py#L107-L119 It needs to be clear that currently the responses generated by SGLang meet the requirements in evaluation tasks. The point mentioned in this issue is that when do_sample is False or temperature is set to 0, it may not completely match with transformers' results. These two are not equivalent. For example, you can try using TensorRT LLM and setting temperature to 0, the results obtained will also differ from transformers'. I reiterate my opinion that I believe SGLang currently does not affect online usage or deployment in formal production environments. However, it's worth looking into the cause of this issue mentioned here as well.
@cherishhh @Abdulhanan535 @tanmaylaud You can take a look at the implementation of https://github.com/fw-ai/llm_eval_meta/blob/main/analyze_answers.py. When evaluating, to determine if the model answer is correct, you can refer to https://github.com/fw-ai/llm_eval_meta/blob/b1166abf1395eafd3a994aefed5f6a420e697289/analyze_answers.py#L107-L119 It needs to be clear that currently the responses generated by SGLang meet the requirements in evaluation tasks. The point mentioned in this issue is that when do_sample is False or temperature is set to 0, it may not completely match with transformers' results. These two are not equivalent. For example, you can try using TensorRT LLM and setting temperature to 0, the results obtained will also differ from transformers'. I reiterate my opinion that I believe SGLang currently does not affect online usage or deployment in formal production environments. However, it's worth looking into the cause of this issue mentioned here as well.
Even when the temperature is set to 0, sglang produces inconsistent outputs each time. This is the key point. Could this variability be attributed to the inherent randomness of the operators involved? 即使temperature设置为0,sglang每次的输出也会有不一致的现象,这才是最重要的。这种误差来自于算子的随机性嘛?
@cherishhh @Abdulhanan535 @tanmaylaud You can take a look at the implementation of https://github.com/fw-ai/llm_eval_meta/blob/main/analyze_answers.py. When evaluating, to determine if the model answer is correct, you can refer to https://github.com/fw-ai/llm_eval_meta/blob/b1166abf1395eafd3a994aefed5f6a420e697289/analyze_answers.py#L107-L119 It needs to be clear that currently the responses generated by SGLang meet the requirements in evaluation tasks. The point mentioned in this issue is that when do_sample is False or temperature is set to 0, it may not completely match with transformers' results. These two are not equivalent. For example, you can try using TensorRT LLM and setting temperature to 0, the results obtained will also differ from transformers'. I reiterate my opinion that I believe SGLang currently does not affect online usage or deployment in formal production environments. However, it's worth looking into the cause of this issue mentioned here as well.
Even when the temperature is set to 0, sglang produces inconsistent outputs each time. This is the key point. Could this variability be attributed to the inherent randomness of the operators involved? 即使temperature设置为0,sglang每次的输出也会有不一致的现象,这才是最重要的。这种误差来自于算子的随机性嘛?
--disable-flashinfer-sampling
@cherishhh @Abdulhanan535 @tanmaylaud You can take a look at the implementation of https://github.com/fw-ai/llm_eval_meta/blob/main/analyze_answers.py. When evaluating, to determine if the model answer is correct, you can refer to https://github.com/fw-ai/llm_eval_meta/blob/b1166abf1395eafd3a994aefed5f6a420e697289/analyze_answers.py#L107-L119 It needs to be clear that currently the responses generated by SGLang meet the requirements in evaluation tasks. The point mentioned in this issue is that when do_sample is False or temperature is set to 0, it may not completely match with transformers' results. These two are not equivalent. For example, you can try using TensorRT LLM and setting temperature to 0, the results obtained will also differ from transformers'. I reiterate my opinion that I believe SGLang currently does not affect online usage or deployment in formal production environments. However, it's worth looking into the cause of this issue mentioned here as well.
Even when the temperature is set to 0, sglang produces inconsistent outputs each time. This is the key point. Could this variability be attributed to the inherent randomness of the operators involved? 即使temperature设置为0,sglang每次的输出也会有不一致的现象,这才是最重要的。这种误差来自于算子的随机性嘛?
--disable-flashinfer-sampling
@CSEEduanyu 请问这个参数--disable-flashinfer-sampling可以禁止随机性吗?
#1589 This will improve the deterministic aspect.
It is the flashinfer sampling that is causing the determinism issue. Do we have any fixes for that? Torch sampling with argmax is deterministic but it is slower.
添加了--disable-flashinfer-sampling ,解决了输出随机的问题,但是在跑一致率的时候,发现和vllm差距较大。效果也很差,。
添加了--disable-flashinfer-sampling ,解决了输出随机的问题,但是在跑一致率的时候,发现和vllm差距较大。效果也很差,。
我使用最新版本,hash值是c996e8ccd415f6e1077ace5bc645d19a8dd40203,并添加--disable-flashinfer-sampling。 使用同一输入,Qwen2-7B模型的输出依然是随机的,请问能提供一份ServerArgs给我进行参考么?
I use the latest version, with a hash value of c996e8cd415f6e1077ace5bc645d19a8dd40203, and adding --disable-flashinfer-sampling. Using the same input, the output of the Qwen2-7B model is still random. Can you provide me with a ServerArgs for reference?
@coderchem 效果差是在什么benchmark上?
@kangqiyue Could you provide a reproducible script? SGLang performs almost identically to transformers on MMLU, GSM8K, and Human Eval. We did not observe the performance issues you mentioned.
@kangqiyue Could you provide a reproducible script? SGLang performs almost identically to
transformerson MMLU, GSM8K, and Human Eval. We did not observe the performance issues you mentioned.
@zhyncs I have found that it is a mistake for my evaluation, not the problem for sglang or vllm. Sorry for the mistake.
This has been one of the biggest issues we've known about for a while. In short, I believe that dynamic batching introduces these variances because different batch sizes dispatch different kernels. We checked the engine implementation and did not find any noticeable bugs (e.g., incorrect caching). We will continue investigating and may introduce a "deterministic mode" as a short-term solution. This mode will use additional padding to increase determinism, although it will run more slowly.
#1589 should make greedy decoding a little bit more deterministic for bs = 1.
Let us move the discussion to #1792
添加了--disable-flashinfer-sampling ,解决了输出随机的问题,但是在跑一致率的时候,发现和vllm差距较大。效果也很差,。
我使用最新版本,hash值是c996e8ccd415f6e1077ace5bc645d19a8dd40203,并添加--disable-flashinfer-sampling。 使用同一输入,Qwen2-7B模型的输出依然是随机的,请问能提供一份ServerArgs给我进行参考么?
I use the latest version, with a hash value of c996e8cd415f6e1077ace5bc645d19a8dd40203, and adding --disable-flashinfer-sampling. Using the same input, the output of the Qwen2-7B model is still random. Can you provide me with a ServerArgs for reference?
请问后来解决了吗