vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Performance][DeepGEMM] Estimate expected_m

Open varun-sundar-rabindranath opened this issue 1 month ago • 2 comments

Purpose

The DeepGEMM fp8_m_grouped_gemm_nt_masked kernel inputs an expected_m parameter. This parameter is an hint to the DeepGEMM kernel that indicates the estimated number of tokens per expert. on main we simply set it to the maximum number to tokens an expert can have. This PR, updates that estimation logic.

Estimation method : We assume a uniform distribution and round the Expected value of M to the next multiple of 16.

Please take a look at Estimate_m step fn for "expected_m" step function for different cuda graph sizes and DP sizes.

Performance

Please take a look at round-16 TLDR; The performance improvements upto 11%

Test Plan

VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 --  vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching  --port 9010
lm-eval : lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0302|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.

I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.

Done. The results are better.