vllm [Performance][DeepGEMM] Estimate expected

Purpose

The DeepGEMM fp8_m_grouped_gemm_nt_masked kernel inputs an expected_m parameter. This parameter is an hint to the DeepGEMM kernel that indicates the estimated number of tokens per expert. on main we simply set it to the maximum number to tokens an expert can have. This PR, updates that estimation logic.

Estimation method : We assume a uniform distribution and round the Expected value of M to the next multiple of 16.

Please take a look at Estimate_m step fn for "expected_m" step function for different cuda graph sizes and DP sizes.

Performance

Please take a look at round-16 TLDR; The performance improvements upto 11%

Test Plan

VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 --  vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching  --port 9010

lm-eval : lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0302|
|     |       |strict-match    |     5|exact_match|↑  | 0.92|±  |0.0273|

Nov 14 '25 00:11 varun-sundar-rabindranath

I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.

Nov 14 '25 01:11 varun-sundar-rabindranath

I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.

Done. The results are better.

Nov 14 '25 04:11 varun-sundar-rabindranath

[Performance][DeepGEMM] Estimate expected_m

Purpose

Performance

Test Plan

Test Result