FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

[Cherry-Pick][RL] R3 Support RDMA Store(#5467)

Open gongshaotian opened this issue 2 weeks ago • 2 comments

Motivation

:bulb: If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

:bulb: 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Performance comparison between RoutingStoreLocal and RoutingStoreRDMA:

1. paddle.load overhead:
   Number of successfully `get` files: 37/37
   Mean overhead: 0.0395 s
   Min overhead: 0.0127 s
   Max overhead: 0.1590 s
   Total overhead: 1.4610 s

2. paddle.save overhead:
   Number of successfully `save` files: 37/37
   Mean overhead:  0.0872 s
   Min overhead: 0.0692 s
   Max overhead: 0.1226 s
   Total overhead: 3.2273 s

3. p2pstore.put overhead:
   Number of successfully `put` files: 37/37
   Mean overhead: 0.0073 s
   Min overhead: 0.0063 s
   Max overhead: 0.0076 s
   Total overhead: 0.2691 s

4. p2pstore.get overhead:
   Number of successfully `get` files: 37/37
   Mean overhead: 0.0027 s
   Min overhead: 0.0027 s
   Max overhead: 0.0029 s
   Total overhead: 0.1008 s

develop PR: https://github.com/PaddlePaddle/FastDeploy/pull/5467 exp_0908 PR: https://github.com/PaddlePaddle/FastDeploy/pull/5454

Modifications

Add RoutingStoreRDMA, using P2P communication to transmit routing.

  • Routing will be stored in the WorkerProcess process where the RoutingStoreRDMA is located and will not be actively released.
  • Thep2pstore dependency library and 'RoutingStoreRDMA' can only be used in RLHF of PaddlePaddle

Usage or Command

Add new parameters for RoutingReplayConfig:

--routing-replay-config '{"enable_routing_replay":true, "routing_store_type":"rdma", "rdma_store_server":"zmq://x.x.x.x:5765,x.x.x.x:5766"}' 

Accuracy Tests

Checklist

  • [x] Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • [x] Format your code, run pre-commit before commit.
  • [x] Add unit tests. Please write the reason in this PR if no unit tests.
  • [x] Provide accuracy results.
  • [x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

gongshaotian avatar Dec 09 '25 13:12 gongshaotian

Thanks for your contribution!

paddle-bot[bot] avatar Dec 09 '25 13:12 paddle-bot[bot]

Codecov Report

:x: Patch coverage is 32.43243% with 25 lines in your changes missing coverage. Please review. :warning: Please upload report for BASE (release/2.4@53158b7). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/layers/moe/routing_indices_cache.py 28.12% 23 Missing :warning:
fastdeploy/worker/gpu_model_runner.py 0.00% 2 Missing :warning:
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #5468   +/-   ##
==============================================
  Coverage               ?   58.99%           
==============================================
  Files                  ?      327           
  Lines                  ?    40662           
  Branches               ?     6177           
==============================================
  Hits                   ?    23989           
  Misses                 ?    14811           
  Partials               ?     1862           
Flag Coverage Δ
GPU 58.99% <32.43%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Dec 09 '25 14:12 codecov-commenter