linqingxu comments

Results 12 comments of


                                            linqingxu

TTFT latency for long context (16K) is very high around 15 seconds for llama3.1 70b model. (same or worse than vLLM)

TTFT latency for long context(10k) ,batch_size=8, mean TTFT: 3496.12ms； sglang 0.3.5 GPU is Radeon RX 7900xtx ![image](https://github.com/user-attachments/assets/73911ba4-6e82-4ebc-b902-6c4185f988b1) batch_size=16， mean TTFT: 12495.6ms ![image](https://github.com/user-attachments/assets/c63e0a6a-d011-4535-809c-6c35769b3bc0)

BUG: NCCL error:

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

用了最新版也还是一样。压测的时候并发7开始就会有请求报错，并发16全部失败

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

2024-09-25 02:58:02,170 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38154) during chat. 2024-09-25 02:58:02,181 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38166) during chat. 2024-09-25 02:58:02,189...

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

Error for prompt with length 5520: Traceback (most recent call last): File "/opt/inference/benchmark/benchmark_runner.py", line 151, in send_request data = json.loads(chunk) File "/usr/lib/python3.10/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py",...

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

请问有解法吗 > 报错日志贴一下。

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

> 用我们的 benchmark 也能复现？就是用的xinference提供的benchmark/benchmark_serving.py 0.15.2

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

这个问题有解决方案吗，只要输出大于1ktoken，并发benchmark就会出现这个问题

调用/v1/chat/completions接口,用jmeter10并发进行压测，压测1分钟xinference就挂了，xinference==0.11.3

sglang和vllm都会，在输出较长情况下（单个请求输出800token），压测就挂（2并发）