Add Transfer Task Latency Distribution Metrics

Summary

This PR adds histogram-based latency distribution tracking for transfer tasks in the Transfer Engine metrics system.

Motivation

Understanding task completion latency distribution is crucial for:

Performance tuning and bottleneck identification
SLA monitoring and analysis
Detecting outliers and tail latencies

Previously, only throughput metrics were available. This enhancement provides detailed visibility into task-level performance.

Changes

Core Implementation

Modified Files:

mooncake-transfer-engine/include/transfer_engine.h
mooncake-transfer-engine/src/transfer_engine.cpp

Key Components:

Latency Tracking
- Record task start time on first getTransferStatus() call
- Calculate and record completion latency when task completes
- Use ylt::metric::histogram_t for efficient distribution tracking
Histogram Buckets
- Fine-grained buckets (10μs - 10s) covering sub-millisecond to multi-second latencies
- 17 boundary values creating 18 buckets optimized for high-performance scenarios
Interval-based Reporting
- Snapshot mechanism for computing per-interval statistics
- Avoids cumulative drift, shows actual distribution within each reporting period
- Aligns with Prometheus-style metrics design (monotonic counters + rate calculation)

Output Format

[Metrics] Transfer Engine Stats (over last 5s): 
Throughput: 22480.90 MB/s | 
Latency Distribution (count=56215): 0-10μs:0.2%, 10-20μs:1.6%, 20-50μs:96.5%, 50-100μs:0.9%, 100-200μs:0.8%

Features:

Only shows buckets with ≥0.1% to reduce noise
Displays task count and percentage distribution
Single-line format for easy parsing and monitoring

Usage

# Enable metrics (required)
export MC_TE_METRIC=1

# Optional: customize interval (default: 5s)
export MC_TE_METRIC_INTERVAL_SECONDS=10

Design Decisions

Histogram over avg/max: Distribution provides complete picture; avg can be skewed by tail latencies, max is unstable
Snapshot mechanism: histogram_t doesn't support reset; snapshot-based interval calculation is standard practice (Prometheus)
Bucket granularity: Focused on sub-ms to catch high-performance variations, with coarser bins for tail

Performance Impact

Minimal overhead: only enabled with WITH_METRICS compile flag
Lock-free histogram updates
Automatic cleanup of completed task timestamps

Testing Instructions:

# Build with metrics
cmake -DWITH_METRICS=ON ..
make

# Run and observe output
export MC_TE_METRIC=1
./your_test_binary

Expected output will show latency distribution every 5 seconds alongside throughput metrics.

Type of Change

Types
- [ ] Bug fix
- [x] New feature
  - [x] Transfer Engine
  - [ ] Mooncake Store
  - [ ] Mooncake EP
  - [ ] Integration
  - [ ] P2P Store
  - [ ] Python Wheel
- [ ] Breaking change
- [ ] CI/CD
- [ ] Documentation update
- [ ] Other

How Has This Been Tested?

✅ Compiles with WITH_METRICS enabled/disabled
✅ No memory leaks (completed tasks are cleaned up)
✅ Thread-safe concurrent task tracking
✅ Accurate distribution reporting under load

Checklist

[x] I have performed a self-review of my own code.
[x] I have updated the documentation.
[x] I have added tests to prove my changes are effective.

Nov 27 '25 12:11 staryxchen

Summary of Changes

Hello @staryxchen, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive task completion latency distribution metrics to the Transfer Engine. This enhancement moves beyond simple throughput measurements to provide critical insights into task-level performance, enabling better performance tuning, more accurate SLA monitoring, and improved identification of tail latencies and bottlenecks.

Highlights

Task Completion Latency Tracking: Implemented histogram-based latency distribution tracking for transfer tasks within the Transfer Engine metrics system.
Detailed Latency Metrics: Records task start times on the first getTransferStatus() call and calculates completion latency when a task finishes, utilizing ylt::metric::histogram_t for efficient distribution tracking.
Configurable Histogram Buckets: Uses 17 fine-grained boundary values (10μs to 10s) to cover sub-millisecond to multi-second latencies, optimized for high-performance scenarios.
Interval-based Reporting: Employs a snapshot mechanism to compute and report per-interval latency statistics, avoiding cumulative drift and aligning with Prometheus-style metrics design.
Enhanced Metrics Output: The metrics output now includes both throughput and a detailed latency distribution, displaying only buckets with >=0.1% to reduce noise and provide a clear, single-line format.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Nov 27 '25 12:11 gemini-code-assist[bot]

👍 I will review it today.

Nov 28 '25 02:11 stmatengss

feat(metrics): add task completion latency tracking and reporting

Add Transfer Task Latency Distribution Metrics

Summary

Motivation

Changes

Core Implementation

Output Format

Usage

Design Decisions

Performance Impact

Type of Change

How Has This Been Tested?

Checklist

Summary of Changes

Highlights