Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (…

Open 1998zxn opened this issue 2 weeks ago • 6 comments

…TC) control to improve ordering flexibility and queue-level QoS

Description

This PR optimizes Mooncake’s performance in the 2P1D scenario by introducing two main improvements:

  1. Relaxed Ordering (RO) support to improve PCIe out-of-order handling
  2. RDMA queue selection via environment variable to improve queue-level QoS under burst traffic

These changes effectively reduce KV Cache transfer time, thereby lowering overall TTFT (Time-To-First-Token) latency.

Background

In our deployment scenario using SGLang DeepSeek v3 with 2P1D configuration:

  • P nodes use tp8, pp2 parallel strategy
  • D nodes use tp8 parallel strategy

We observed that KV Cache transfers could account for up to 23% of the total TTFT. The reasons are:

  1. Mooncake does not enable Relaxed Ordering by default, reducing PCIe out-of-order packet handling efficiency. This also addresses the issue discussed in #39.
  2. In 2P1D burst traffic scenarios, RDMA queue scheduling can cause congestion, affecting transfer performance.

By enabling these two features, we reduced TTFT from 650ms to ~585ms, and KV Cache transfer time dropped to 15% of TTFT, showing significant performance improvements.

Implementation Details

  1. Relaxed Ordering (RO) Support
  • Detects if hardware supports RO and enables it
  • Added logging to indicate whether RO is enabled during runtime, improving observability
  1. RDMA Queue Selection and QoS
  • Adds an environment variable to select RDMA queues
  • Added logging to show the details of traffic classes, helping verify correct functionality

Fully backward compatible; scenarios not using these new features remain unaffected with no regression risk.

Type of Change

  • Types
    • [ ] Bug fix
    • [*] New feature
      • [*] Transfer Engine
      • [ ] Mooncake Store
      • [ ] Mooncake EP
      • [ ] Integration
      • [ ] P2P Store
      • [ ] Python Wheel
    • [ ] Breaking change
    • [ ] CI/CD
    • [ ] Documentation update
    • [ ] Other

Checklist

  • [*] I have performed a self-review of my own code.
  • [ ] I have updated the documentation.
  • [ ] I have added tests to prove my changes are effective.

1998zxn avatar Nov 18 '25 11:11 1998zxn

Summary of Changes

Hello @1998zxn, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes Mooncake's performance, particularly in 2P1D deployment scenarios, by targeting bottlenecks in KV Cache transfers. It achieves this through two key enhancements: enabling PCIe Relaxed Ordering to improve data transfer flexibility and implementing RDMA traffic class control for better queue-level QoS under high-traffic conditions. These changes are designed to collectively reduce Time-To-First-Token (TTFT) latency by optimizing the underlying data transport mechanisms.

Highlights

  • PCIe Relaxed Ordering (RO) Support: This PR introduces support for PCIe Relaxed Ordering, which dynamically detects hardware capabilities and enables RO to improve PCIe out-of-order packet handling efficiency. This optimization is crucial for reducing KV Cache transfer times.
  • RDMA Traffic Class (TC) Control: A new mechanism has been added to allow users to control RDMA queue selection via the MoonCake_TC_DEFAULT environment variable. This enables queue-level Quality of Service (QoS) to mitigate congestion during burst traffic scenarios, further enhancing transfer performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot] avatar Nov 18 '25 11:11 gemini-code-assist[bot]

@1998zxn Is there any performance result about how both options affect TTFT?

alogfans avatar Nov 19 '25 12:11 alogfans

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

staryxchen avatar Nov 19 '25 15:11 staryxchen

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant. https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

Aleda avatar Nov 20 '25 09:11 Aleda

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant. https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

A configuration option is better, and I prefer to set it to disable by default. So users who are not concerned about these settings will not be affected in any way; performance-focused users can enable them as needed (They are more likely to fully grasp the implications of this feature).

staryxchen avatar Nov 20 '25 10:11 staryxchen

@alogfans Could you give some suggestions?

stmatengss avatar Dec 02 '25 15:12 stmatengss

@1998zxn Is there any performance result about how both options affect TTFT?

In the scenario where RO is disabled and only TC is adjusted, we configured different congestion control policies for different TCs. When using ECN for congestion control, the worst TTFT observed was around 650ms, while with PFC for congestion control, TTFT could be reduced to approximately 610ms.

As for the scenario where RO is enabled along with TC adjustments, more detailed testing has not been conducted yet. However, we believe that enabling RO along with PFC congestion control should not perform worse than enabling RO with other congestion control strategies.

1998zxn avatar Dec 03 '25 03:12 1998zxn