[Performance][DeepGEMM] Estimate expected_m
Purpose
The DeepGEMM fp8_m_grouped_gemm_nt_masked kernel inputs an expected_m parameter. This parameter is an hint to the DeepGEMM kernel that indicates the estimated number of tokens per expert. on main we simply set it to the maximum number to tokens an expert can have. This PR, updates that estimation logic.
Estimation method : We assume a uniform distribution and round the Expected value of M to the next multiple of 16.
Please take a look at Estimate_m step fn for "expected_m" step function for different cuda graph sizes and DP sizes.
Performance
Please take a look at round-16 TLDR; The performance improvements upto 11%
Test Plan
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
lm-eval : lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100
Test Result
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.90|± |0.0302|
| | |strict-match | 5|exact_match|↑ | 0.92|± |0.0273|
I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.
I think it makes sense to just round up to multiples of 16. Power of 2 could be too aggressive. I'll update the PR to see if that is better.
Done. The results are better.