Current vidur backend support
We’re attempting to reproduce the simulation results and observed that when comparing against vLLM 0.9.1 benchmarks, the P50 latency differs by 700%. May I ask if vLLM v1 is supported by Vidur? If not, which framework and version does Vidur use to reproduce the published results?
Specifically, when running the example command in the README (shown below), which LLM engine should we use to validate the simulation output? Is Vidur’s simulation based on vLLM or Sarathi-Serve?
When using vLLM 0.9.1, the mooncake_conversation_trace.csv trace fails because the total token length exceeds the max_model_len = 8192 limit for Meta-Llama-3-8B. Even after scaling down the token length and rerunning, the simulated latency still does not match vLLM’s measurements. Which framework does Vidur currently support, and what trace/configuration settings would you recommend for reproducing the results accurately?
python -m vidur.main \
--time_limit 10800 \
--replica_config_model_name meta-llama/Meta-Llama-3-8B \
--replica_config_device h100 \
--replica_config_network_device h100_dgx \
--cluster_config_num_replicas 8 \
--replica_config_tensor_parallel_size 1 \
--replica_config_num_pipeline_stages 1 \
--request_generator_config_type synthetic \
--synthetic_request_generator_config_num_requests 128 \
--length_generator_config_type trace \
--trace_request_length_generator_config_trace_file ./data/processed_traces/mooncake_conversation_trace.csv \
--interval_generator_config_type poisson \
--poisson_request_interval_generator_config_qps 8.0 \
--global_scheduler_config_type round_robin \
--replica_scheduler_config_type vllm_v1 \
--vllm_v1_scheduler_config_chunk_size 512 \
--vllm_v1_scheduler_config_batch_size_cap 512 \
--cache_config_enable_prefix_caching
Hi @nba556677go,
There is a vidur branch in sarathi-serve repo. That would be closest baseline for Vidur. Next closest is the main branch. vLLM has undergone tremendous amount of development since Vidur was released mid-2024. Still 700% error means something fundamental is wrong. Things to look out:
- Try disabling prefix caching and reconciling the two systems. To reconcile with prefix caching, Vidur needs to support actual token ids in request which it doesn’t today. This is what vLLM (or any real inference system) uses but Vidur supports only hash of blocks of token ids in the requests today.
- (General advice) Start with
staticworkloads where all the requests arrive at start itself. Then proceed to experiments where requests arrive at a certain qps. Note that beyond a certain (capacity) qps, latencies of both the actual system and Vidur shoot up and it is not possible to reconcile both the systems. See how to run Vidur with different workloads guide.
I would try to match Vidur with vLLM v1 today as it up-to-date and has incorporated chunked prefill in it.
Does seem that QPS need to be selected carefully. I switched to static with just 5 request sent with vllm v1, and P50 e2e latency error is now 30%
Do you have the number of maximum QPS capacity for H100/A100 on Meta-Llama-3-8B model?
Also, since running sarathi-serve is not useful for us, do you think 30% error can be mostly reduced, without reconciling token ids, by
- running the closest vllm version(0.5.1?) to sarathi serve. OR
- reprofiling the device? I am using aws P5 instance rather than h100-dgx and not sure if the overall system stack would affect the performance.
I am trying to evaluate which approach can be more helpful for the goal of running vidur on vllm v1.