feat: plot vllm internal metrics to the wandb log
What does this PR do ?
- Builds upon [#1534] by tracking two additional vLLM metrics:
kv_cache_usage_percandgeneration_tokens. - Adds W&B plotting for all vLLM metrics introduced in this PR and the previous one.
Issues
List issues that this PR closes (syntax):
Usage
- You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
- [ ] Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.
Additional Information
- ...
Summary by CodeRabbit
- New Features
- New vLLM metrics tracked and logged: KV cache usage percentage, preemption count, and generation tokens are now collected and visible in monitoring dashboards when metrics logging is configured
- Per-worker timeline visualization displays granular per-worker metric data as individual series on shared plots with synchronized time-axis representation
✏️ Tip: You can customize this high-level summary in your review settings.
📝 Walkthrough
Walkthrough
The changes extend vLLM metrics collection to include three new metrics (kv_cache_usage_perc, num_preemptions, generation_tokens), add a per-worker timeline visualization utility to the logger, create a helper function to log these metrics to wandb, and integrate the logging into the GRPO training loop when configured.
Changes
| Cohort / File(s) | Summary |
|---|---|
vLLM metrics collection nemo_rl/models/generation/vllm/vllm_generation.py, nemo_rl/models/generation/vllm/vllm_worker_async.py |
Added support for three new vLLM metrics: kv_cache_usage_perc (float), num_preemptions (int), and generation_tokens (int). Extended metrics initialization, accumulation, and return types to include these new fields. |
Logging utilities nemo_rl/algorithms/utils.py |
Added new public function log_vllm_metrics_to_wandb that iterates over vLLM metrics and logs them per-worker via the Logger instance. |
Logger enhancement nemo_rl/utils/logger.py |
Added new public method log_plot_per_worker_timeline_metrics that constructs per-worker time-series plots and delegates logging to all configured backends. |
GRPO training integration nemo_rl/algorithms/grpo.py |
Imported log_vllm_metrics_to_wandb and added conditional calls to log vLLM metrics to wandb when configured in both synchronous and asynchronous training paths. |
Sequence Diagram
sequenceDiagram
participant GRPO as GRPO Training
participant VllmWorker as VllmAsyncGenerationWorker
participant MetricsUtil as metrics_utils
participant Logger as Logger
participant Wandb as W&B Backend
GRPO->>VllmWorker: Collect vLLM metrics
VllmWorker->>VllmWorker: Track kv_cache_usage_perc,<br/>num_preemptions,<br/>generation_tokens
VllmWorker->>GRPO: get_vllm_logger_metrics()
GRPO->>MetricsUtil: log_vllm_metrics_to_wandb()
MetricsUtil->>Logger: log_plot_per_worker_timeline_metrics()
Logger->>Logger: Construct per-worker<br/>time-series plots
Logger->>Wandb: log_plot()
Wandb->>Wandb: Visualize metrics
Estimated code review effort
🎯 2 (Simple) | ⏱️ ~12 minutes
- The changes follow consistent, repetitive patterns—the same three metrics are added across multiple files in similar ways
- Most edits are additive (new functions, new attributes, new logging calls) with no complex logic changes
- The new utility functions are straightforward with predictable control flow
- Review focus: verify consistency of metric naming and types across all collection and logging points, confirm proper integration in both sync and async training paths
Possibly related PRs
- PR
#1534: Continues and extends the same per-worker vLLM metrics collection and visualization work, including similar changes to vLLM worker metrics and logging utilities integration.
Suggested labels
enhancement
Suggested reviewers
- terrykong
Pre-merge checks and finishing touches
❌ Failed checks (2 warnings)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
| Test Results For Major Changes | ⚠️ Warning | PR introduces major new features (new public methods, functions, and attributes for vLLM metrics plotting) but lacks test results or testing documentation. | Add test results and testing information documenting unit tests, integration tests, and verification that no regressions were introduced. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title accurately describes the main change: adding functionality to plot vLLM internal metrics to wandb, which aligns with all the file modifications across the codebase. |
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
youngeunk/vllm-wandb-plot
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Hi @terrykong, this is the PR that enables wandb plot what I shared today. Can I ask for your reviews please?