feat(metrics): add task completion latency tracking and reporting
Add Transfer Task Latency Distribution Metrics
Summary
This PR adds histogram-based latency distribution tracking for transfer tasks in the Transfer Engine metrics system.
Motivation
Understanding task completion latency distribution is crucial for:
- Performance tuning and bottleneck identification
- SLA monitoring and analysis
- Detecting outliers and tail latencies
Previously, only throughput metrics were available. This enhancement provides detailed visibility into task-level performance.
Changes
Core Implementation
Modified Files:
-
mooncake-transfer-engine/include/transfer_engine.h -
mooncake-transfer-engine/src/transfer_engine.cpp
Key Components:
-
Latency Tracking
- Record task start time on first
getTransferStatus()call - Calculate and record completion latency when task completes
- Use
ylt::metric::histogram_tfor efficient distribution tracking
- Record task start time on first
-
Histogram Buckets
- Fine-grained buckets (10μs - 10s) covering sub-millisecond to multi-second latencies
- 17 boundary values creating 18 buckets optimized for high-performance scenarios
-
Interval-based Reporting
- Snapshot mechanism for computing per-interval statistics
- Avoids cumulative drift, shows actual distribution within each reporting period
- Aligns with Prometheus-style metrics design (monotonic counters + rate calculation)
Output Format
[Metrics] Transfer Engine Stats (over last 5s):
Throughput: 22480.90 MB/s |
Latency Distribution (count=56215): 0-10μs:0.2%, 10-20μs:1.6%, 20-50μs:96.5%, 50-100μs:0.9%, 100-200μs:0.8%
Features:
- Only shows buckets with ≥0.1% to reduce noise
- Displays task count and percentage distribution
- Single-line format for easy parsing and monitoring
Usage
# Enable metrics (required)
export MC_TE_METRIC=1
# Optional: customize interval (default: 5s)
export MC_TE_METRIC_INTERVAL_SECONDS=10
Design Decisions
- Histogram over avg/max: Distribution provides complete picture; avg can be skewed by tail latencies, max is unstable
-
Snapshot mechanism:
histogram_tdoesn't support reset; snapshot-based interval calculation is standard practice (Prometheus) - Bucket granularity: Focused on sub-ms to catch high-performance variations, with coarser bins for tail
Performance Impact
- Minimal overhead: only enabled with
WITH_METRICScompile flag - Lock-free histogram updates
- Automatic cleanup of completed task timestamps
Testing Instructions:
# Build with metrics
cmake -DWITH_METRICS=ON ..
make
# Run and observe output
export MC_TE_METRIC=1
./your_test_binary
Expected output will show latency distribution every 5 seconds alongside throughput metrics.
Type of Change
- Types
- [ ] Bug fix
- [x] New feature
- [x] Transfer Engine
- [ ] Mooncake Store
- [ ] Mooncake EP
- [ ] Integration
- [ ] P2P Store
- [ ] Python Wheel
- [ ] Breaking change
- [ ] CI/CD
- [ ] Documentation update
- [ ] Other
How Has This Been Tested?
- ✅ Compiles with
WITH_METRICSenabled/disabled - ✅ No memory leaks (completed tasks are cleaned up)
- ✅ Thread-safe concurrent task tracking
- ✅ Accurate distribution reporting under load
Checklist
- [x] I have performed a self-review of my own code.
- [x] I have updated the documentation.
- [x] I have added tests to prove my changes are effective.
Summary of Changes
Hello @staryxchen, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive task completion latency distribution metrics to the Transfer Engine. This enhancement moves beyond simple throughput measurements to provide critical insights into task-level performance, enabling better performance tuning, more accurate SLA monitoring, and improved identification of tail latencies and bottlenecks.
Highlights
- Task Completion Latency Tracking: Implemented histogram-based latency distribution tracking for transfer tasks within the Transfer Engine metrics system.
-
Detailed Latency Metrics: Records task start times on the first
getTransferStatus()call and calculates completion latency when a task finishes, utilizingylt::metric::histogram_tfor efficient distribution tracking. - Configurable Histogram Buckets: Uses 17 fine-grained boundary values (10μs to 10s) to cover sub-millisecond to multi-second latencies, optimized for high-performance scenarios.
- Interval-based Reporting: Employs a snapshot mechanism to compute and report per-interval latency statistics, avoiding cumulative drift and aligning with Prometheus-style metrics design.
- Enhanced Metrics Output: The metrics output now includes both throughput and a detailed latency distribution, displaying only buckets with >=0.1% to reduce noise and provide a clear, single-line format.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.
👍 I will review it today.