Track expert selection metrics
Purpose
The goal is to track expert unbalance and get those metrics available in prometheus
How to use
Use VLLM_COLLECT_EXPERT_USAGE_HISTOGRAM=1 to enable this feature.
Make sure that PROMETHEUS_MULTIPROC_DIR is set to get proper metrics!
The moe_expert_selection metric will then be available in prometheus at runtime.
Performance considerations
Expect a 2% maximum e2e overhead when running this! Perf on the GPU side is negligible. Note that this PR does enable anything by default, so perf is untouched this way.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ryp.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ryp.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ryp.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
We just merged EPLB from @abmfy (cc @WoosukKwon). Please rebase and we would love to expose this core metrics!
Ready for review!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @Ryp.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
Sorry for the delay in reviewing. After https://github.com/vllm-project/vllm/pull/20562 is merged, this could be enabled using that config instead of with an environment variable.
@Ryp hi, can you provide a metric info about request /metric api.
Does this method support tp>1? Is communication aggregation heat required when tp>1? Can it support recording physical expert heat? If physical experts are transmitted, it may also be necessary to transmit physical experts to logical experts.
Hello, is this still active? I believe EPLB could leverage the expert selection metrics collection from this PR to avoid duplicating efforts. Should we consider refactoring it to make it compatible with EPLB?
Hello, is this still active? I believe EPLB could leverage the expert selection metrics collection from this PR to avoid duplicating efforts. Should we consider refactoring it to make it compatible with EPLB?
Hello, I'm also interested in EPLB metrics. Does PR currently support TP > 1? When TP > 1, an all-gather communication is required.
@mickaelseznec Will take over this PR - news will come from him. Thanks
Expert selection tracking is moved to https://github.com/vllm-project/vllm/pull/27105.