aibrix Failed to run benchmark scripts against the endpoint

🐛 Describe the bug

python3 benchmark_serving.py --backend vllm  --model deepseek-ai/deepseek-r1 --trust-remote-code --served-model-name deepseek-r1-671b --base-url http://localhost:8888 --endpoint /v1/completions --num-prompts 100 --request-rate 2 --metric_percentiles '50,90,95,99' --goodput ttft:1000 tpot:100 --max-concurrency 200 --random-input-len 2048 --random-output-len 200 --dataset-name random --ignore-eos

Starting initial single prompt test run...
RequestFuncOutput(generated_text='', success=False, latency=0.0, output_tokens=0, ttft=0.0, itl=[], tpot=0.0, prompt_len=2048, error='Bad Request')
Traceback (most recent call last):
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 1315, in <module>
    main(args)
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 951, in main
    benchmark_result = asyncio.run(
  File "/Users/bytedance/.pyenv/versions/3.10.10/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/bytedance/.pyenv/versions/3.10.10/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 602, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Bad Request

gateway logs

I0303 00:53:49.475583       1 gateway.go:221]
I0303 00:53:49.475604       1 gateway.go:222] "-- In RequestHeaders processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19"
I0303 00:53:49.475949       1 gateway.go:287] "-- In RequestBody processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19"
I0303 00:53:49.476224       1 gateway.go:388] "request start" requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19" model="deepseek-r1-671b" routingStrategy="random" targetPodIP="192.168.0.74:8000"
I0303 00:53:49.477602       1 gateway.go:407] "-- In ResponseHeaders processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19"
I0303 00:53:49.477827       1 gateway.go:440] "-- In ResponseBody processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19" endOfSteam=false
I0303 00:53:49.477858       1 gateway.go:440] "-- In ResponseBody processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19" endOfSteam=false
I0303 00:53:49.477869       1 gateway.go:440] "-- In ResponseBody processing ..." requestID="4cb7758a-aa7c-49e5-a6d0-8243aba62a19" endOfSteam=true

192.168.0.74 is the head pod but not request is coming into engine side. could be streaming issue?

Steps to Reproduce

deepseek-r1.yaml

Expected behavior

benchmark should work as expected

Environment

0.2.0

Mar 03 '25 00:03 Jeffwan

Might be related to #757. I discovered that issue when using SGLang's bench_serving, which should be quite similar to vLLM benchmark_serving

Mar 03 '25 01:03 gau-nernst

🐛 Describe the bug

python3 benchmark_serving.py --backend vllm  --model deepseek-ai/deepseek-r1 --trust-remote-code --served-model-name deepseek-r1-671b --base-url http://localhost:8888 --endpoint /v1/completions --num-prompts 100 --request-rate 2 --metric_percentiles '50,90,95,99' --goodput ttft:1000 tpot:100 --max-concurrency 200 --random-input-len 2048 --random-output-len 200 --dataset-name random --ignore-eos

Using this script, I am able to run the benchmark. It requires the fix in PR #794 and for our local cluster since there is no httproute, I had to use routing-strategy to circumvent it.

Mar 05 '25 00:03 varungup90

@varungup90 I did make above change but seems it's not working. your scripts is exact same as the one I attached in the issue, right?

Just notice your test against https://github.com/vllm-project/aibrix/pull/794. I left some comments, I think we should completely remove these validation from gateway side and just need to better handle the error response from engines. We should not make any changes in gateway that may break openAI API compatibility. Recently, we have a few cases like this, like streaming, error response handling, and arg validation etc.

Mar 05 '25 09:03 Jeffwan

I think we should completely remove these validation from gateway side

This is not the validation, it reads message for the prefix cache. One alternative is to only call getRequestMessage for prefix-cache routing strategy, but it will still need same fix.

Mar 05 '25 18:03 varungup90

@varungup90 did your test against the deepseek model? I used latest release-0.2 version and the problem still exist and I tried both httpRoute and custom strategy.

INFO 03-09 11:49:41 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
Namespace(backend='vllm', base_url='http://localhost:8888', host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=200, model='deepseek-ai/deepseek-r1', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=100, logprobs=None, request_rate=2.0, burstiness=1.0, seed=0, trust_remote_code=True, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='50,90,95,99', goodput=['ttft:1000', 'tpot:100'], sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=2048, random_output_len=200, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name='deepseek-r1-671b', lora_modules=None)
Starting initial single prompt test run...
Traceback (most recent call last):
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 1314, in <module>
    main(args)
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 950, in main
    benchmark_result = asyncio.run(
  File "/Users/bytedance/.pyenv/versions/3.10.10/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/bytedance/.pyenv/versions/3.10.10/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/bytedance/workspace/vllm/benchmarks/benchmark_serving.py", line 601, in benchmark
    raise ValueError(
ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Never received a valid chunk to calculate TTFT.This response will be marked as failed!

Mar 09 '25 03:03 Jeffwan

Varun found it's more like data issue. I change from random to shareGPT dataset --dataset-name sharegpt --dataset-path /Users/bytedance/Downloads/ShareGPT_V3_unfiltered_cleaned_split.json

Note: --dataset-name sonnet --dataset-path /Users/bytedance/workspace/vllm/benchmarks/sonnet.txt works as well

dataset is here

https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json

full command is here

python benchmark_serving.py --backend vllm  --model deepseek-ai/deepseek-r1 --trust-remote-code --served-model-name deepseek-r1-671b --base-url http://localhost:8888 --endpoint /v1/completions --num-prompts 100 --request-rate 2 --metric_percentiles '50,90,95,99' --goodput ttft:1000 tpot:100 --max-concurrency 200 --random-input-len 2048 --random-output-len 200 --dataset-name sharegpt --dataset-path /Users/bytedance/Downloads/ShareGPT_V3_unfiltered_cleaned_split.json --ignore-eos

Currently, at least we can prove previous code fix works fine (streaming + completion api). for the special character issues, we need to do more investigation in https://github.com/vllm-project/aibrix/issues/832, let's track it separately

Mar 09 '25 06:03 Jeffwan

This is fixed now.

Jun 02 '25 17:06 varungup90