FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports

Open liyonghua0910 opened this issue 3 weeks ago • 2 comments

Motivation

该 PR 旨在实现两个目标:

  1. 简化 PD 分离的部署流程和参数,包括端口号配置、RDMA 网卡检测、相关环境变量设置等等工序,实现【启动 Router】→【启动 P&D 实例】→【部署完成】的简易部署流程。其中启动参数的简化也期望适用于集中式部署和多 TP/DP 部署,并兼容通过 APIServer 和 MultiAPIServer 多 DP 服务的启动方式。
  2. 重构当前代码中与端口号相关的配置处理和使用逻辑。在参数初始化时,若用户未指定端口号,自动寻找可用端口,需要支持在线服务和离线接口;在多 DP 部署场景,在配置初始化时切分好各 DP 所需的端口号,而不是在使用时才临时切分。尽量实现配置的静态化、只读化,减少运行时的配置更改。

Modifications

  • ArgsUtils
    • 新增 post_init_all_ports 参数后处理和检查流程,在 EngineArgs 初始化时,会检查用户传入的各类端口号数量是否正确。如果用户未传入端口号,会自动为用户分配所需数量的端口号。
  • FDConfig
    • 去除旧的不规范的类型转换逻辑,在 config 初始化时会用 parse_ports() 方法统一将 *_port 类变量转成 list[int] 类型
    • 🌟 新增 local_* 类端口号变量,包括 local_engine_worker_queue_port (int), local_rdma_comm_ports (list[int]), local_pd_comm_port (int), local_cache_queue_port (int),在 DP/EP 场景用来指代当前 DP 使用的端口号,非 DP/EP 场景也统一使用 local_* 类的端口号变量
    • 🌟 新增 postprocess_devices_and_ports 的 config 后处理流程,在 FDConfig.postprocess 中,会统一为 local_* 类端口号变量赋值,切分出当前 DP 所需的端口号,不建议在 config.py 以外的模块修改 FDConfig 对象
  • MultiAPIServer
    • 新增参数检查流程,如果用户未传入端口号,会自动为用户分配所需数量的端口号;
    • 如果用户传入的端口号数量不正确,会重新为用户分配所需数量的端口号
    • 默认设置 FD_ENABLE_MULTI_API_SERVER 环境变量
  • Cache
    • CacheTransferManager & CacheMessager 接收的参数名 engine_pid 修改为 ipc_suffix ,更贴合语义
    • 🌟 RDMACacheTransfer 新增初始化代码,自动设置 KVCACHE_RDMA_NICS, KVCACHE_GDRCOPY_FLUSH_ENABLE 环境变量
  • CommonEngine
    • 去除部分端口号列表切分、类型转换的逻辑(已经移到 config 层处理)
    • 修改使用的 llm_logger 对象,如果是 DP 场景应该将日志写入 _dprank*.log 文件
  • Utils
    • 新增端口号检测、解析和自动寻找可用端口号的工具函数
  • Examples
    • 简化 start_v1_dp2.sh 和 start_v1_tp1.sh 的启动命令,新增用例 start_v1_tp2.sh,并优化 utils.sh 中的工具函数
  • Others
    • 🌟 将所有端口号变量都配套修改为使用 local_* 类端口号变量
    • 兼容使用 api server 启动多 DP 的方式
      • 去除部分 DP 逻辑的 EP 限制
      • 在 DP0 创建 DP1-N 时,深度拷贝当前 DP0 的 cfg 给各个 DP,避免 DP1-N 内部有修改 config 的操作互相干扰
      • 在 ExpertService 初始化时,根据 local_data_parallel_id 重写当前 DP 的部分配置
      • 🌟 让每个 DP 都创建一个 EngineCacheQueue 服务,而不是所有 DP 共享一个,与 EngineWorkerQueue 的架构对齐

Usage or Command

bash examples/splitwise/start_v1_dp2.sh

Accuracy Tests

$ bash examples/splitwise/start_v1_dp2.sh
ROUTER_PORT: 8274
nohup: redirecting stderr to stdout
P_SERVER_PORTS: 8629,8631
nohup: redirecting stderr to stdout
-------- WAIT FOR HEALTH --------
Port 8629: [OK]   200
Port 8631: [OK]   200
All services are ready!    [38s]
---------------------------------
D_SERVER_PORTS: 8534,8535
nohup: redirecting stderr to stdout
-------- WAIT FOR HEALTH --------
Port 8534: [OK]   200
Port 8535: [OK]   200
All services are ready!    [37s]
---------------------------------
{"id":"chatcmpl-3dcecc9a-cf59-45a1-876a-e66535c6ef23","object":"chat.completion","created":1765535525,"model":"/root/paddlejob/workspace/env_run/output/models/ERNIE-4.5-0.3B-Paddle/","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","multimodal_content":null,"reasoning_content":null,"audio_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"draft_logprobs":null,"prompt_logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":8,"total_tokens":18,"completion_tokens":10,"prompt_tokens_details":{"cached_tokens":0,"image_tokens":0,"video_tokens":0},"completion_tokens_details":{"reasoning_tokens":0,"image_tokens":0}}}

Checklist

  • [x] Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • [x] Format your code, run pre-commit before commit.
  • [x] Add unit tests. Please write the reason in this PR if no unit tests.
  • [x] Provide accuracy results.
  • [x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

liyonghua0910 avatar Dec 06 '25 05:12 liyonghua0910

Thanks for your contribution!

paddle-bot[bot] avatar Dec 06 '25 05:12 paddle-bot[bot]

Codecov Report

:x: Patch coverage is 68.51852% with 85 lines in your changes missing coverage. Please review. :warning: Please upload report for BASE (develop@c9b47f9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/entrypoints/openai/multi_api_server.py 53.22% 14 Missing and 15 partials :warning:
fastdeploy/engine/common_engine.py 58.33% 18 Missing and 2 partials :warning:
fastdeploy/utils.py 64.51% 5 Missing and 6 partials :warning:
...he_manager/transfer_factory/rdma_cache_transfer.py 77.77% 5 Missing and 3 partials :warning:
fastdeploy/engine/args_utils.py 81.08% 4 Missing and 3 partials :warning:
fastdeploy/engine/engine.py 42.85% 2 Missing and 2 partials :warning:
fastdeploy/engine/expert_service.py 83.33% 0 Missing and 2 partials :warning:
fastdeploy/entrypoints/openai/api_server.py 0.00% 2 Missing :warning:
fastdeploy/worker/worker_process.py 33.33% 1 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5415   +/-   ##
==========================================
  Coverage           ?   62.08%           
==========================================
  Files              ?      329           
  Lines              ?    41287           
  Branches           ?     6295           
==========================================
  Hits               ?    25633           
  Misses             ?    13701           
  Partials           ?     1953           
Flag Coverage Δ
GPU 62.08% <68.51%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Dec 06 '25 06:12 codecov-commenter