linqingxu

Results 12 comments of linqingxu

TTFT latency for long context(10k) ,batch_size=8, mean TTFT: 3496.12ms; sglang 0.3.5 GPU is Radeon RX 7900xtx ![image](https://github.com/user-attachments/assets/73911ba4-6e82-4ebc-b902-6c4185f988b1) batch_size=16, mean TTFT: 12495.6ms ![image](https://github.com/user-attachments/assets/c63e0a6a-d011-4535-809c-6c35769b3bc0)

用了最新版也还是一样。压测的时候并发7开始就会有请求报错,并发16全部失败

2024-09-25 02:58:02,170 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38154) during chat. 2024-09-25 02:58:02,181 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38166) during chat. 2024-09-25 02:58:02,189...

Error for prompt with length 5520: Traceback (most recent call last): File "/opt/inference/benchmark/benchmark_runner.py", line 151, in send_request data = json.loads(chunk) File "/usr/lib/python3.10/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py",...

> 用我们的 benchmark 也能复现? 就是用的xinference提供的benchmark/benchmark_serving.py 0.15.2

这个问题有解决方案吗,只要输出大于1ktoken,并发benchmark就会出现这个问题

sglang和vllm都会,在输出较长情况下(单个请求输出800token),压测就挂(2并发)