linqingxu
linqingxu
TTFT latency for long context(10k) ,batch_size=8, mean TTFT: 3496.12ms; sglang 0.3.5 GPU is Radeon RX 7900xtx  batch_size=16, mean TTFT: 12495.6ms 
用了最新版也还是一样。压测的时候并发7开始就会有请求报错,并发16全部失败
2024-09-25 02:58:02,170 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38154) during chat. 2024-09-25 02:58:02,181 xinference.api.restful_api 1 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=38166) during chat. 2024-09-25 02:58:02,189...
Error for prompt with length 5520: Traceback (most recent call last): File "/opt/inference/benchmark/benchmark_runner.py", line 151, in send_request data = json.loads(chunk) File "/usr/lib/python3.10/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py",...
> 用我们的 benchmark 也能复现? 就是用的xinference提供的benchmark/benchmark_serving.py 0.15.2
这个问题有解决方案吗,只要输出大于1ktoken,并发benchmark就会出现这个问题
sglang和vllm都会,在输出较长情况下(单个请求输出800token),压测就挂(2并发)