FastDeploy [Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports

Motivation

该 PR 旨在实现两个目标：

简化 PD 分离的部署流程和参数，包括端口号配置、RDMA 网卡检测、相关环境变量设置等等工序，实现【启动 Router】→【启动 P&D 实例】→【部署完成】的简易部署流程。其中启动参数的简化也期望适用于集中式部署和多 TP/DP 部署，并兼容通过 APIServer 和 MultiAPIServer 多 DP 服务的启动方式。
重构当前代码中与端口号相关的配置处理和使用逻辑。在参数初始化时，若用户未指定端口号，自动寻找可用端口，需要支持在线服务和离线接口；在多 DP 部署场景，在配置初始化时切分好各 DP 所需的端口号，而不是在使用时才临时切分。尽量实现配置的静态化、只读化，减少运行时的配置更改。

Modifications

ArgsUtils
- 新增 post_init_all_ports 参数后处理和检查流程，在 EngineArgs 初始化时，会检查用户传入的各类端口号数量是否正确。如果用户未传入端口号，会自动为用户分配所需数量的端口号。
FDConfig
- 去除旧的不规范的类型转换逻辑，在 config 初始化时会用 parse_ports() 方法统一将 *_port 类变量转成 list[int] 类型
- 🌟 新增 local_* 类端口号变量，包括 local_engine_worker_queue_port (int), local_rdma_comm_ports (list[int]), local_pd_comm_port (int), local_cache_queue_port (int)，在 DP/EP 场景用来指代当前 DP 使用的端口号，非 DP/EP 场景也统一使用 local_* 类的端口号变量
- 🌟 新增 postprocess_devices_and_ports 的 config 后处理流程，在 FDConfig.postprocess 中，会统一为 local_* 类端口号变量赋值，切分出当前 DP 所需的端口号，不建议在 config.py 以外的模块修改 FDConfig 对象
MultiAPIServer
- 新增参数检查流程，如果用户未传入端口号，会自动为用户分配所需数量的端口号；
- 如果用户传入的端口号数量不正确，会重新为用户分配所需数量的端口号
- 默认设置 FD_ENABLE_MULTI_API_SERVER 环境变量
Cache
- CacheTransferManager & CacheMessager 接收的参数名 engine_pid 修改为 ipc_suffix ，更贴合语义
- 🌟 RDMACacheTransfer 新增初始化代码，自动设置 KVCACHE_RDMA_NICS, KVCACHE_GDRCOPY_FLUSH_ENABLE 环境变量
CommonEngine
- 去除部分端口号列表切分、类型转换的逻辑（已经移到 config 层处理）
- 修改使用的 llm_logger 对象，如果是 DP 场景应该将日志写入 _dprank*.log 文件
Utils
- 新增端口号检测、解析和自动寻找可用端口号的工具函数
Examples
- 简化 start_v1_dp2.sh 和 start_v1_tp1.sh 的启动命令，新增用例 start_v1_tp2.sh，并优化 utils.sh 中的工具函数
Others
- 🌟 将所有端口号变量都配套修改为使用 local_* 类端口号变量
- 兼容使用 api server 启动多 DP 的方式
  - 去除部分 DP 逻辑的 EP 限制
  - 在 DP0 创建 DP1-N 时，深度拷贝当前 DP0 的 cfg 给各个 DP，避免 DP1-N 内部有修改 config 的操作互相干扰
  - 在 ExpertService 初始化时，根据 local_data_parallel_id 重写当前 DP 的部分配置
  - 🌟 让每个 DP 都创建一个 EngineCacheQueue 服务，而不是所有 DP 共享一个，与 EngineWorkerQueue 的架构对齐

Usage or Command

bash examples/splitwise/start_v1_dp2.sh

Accuracy Tests

$ bash examples/splitwise/start_v1_dp2.sh
ROUTER_PORT: 8274
nohup: redirecting stderr to stdout
P_SERVER_PORTS: 8629,8631
nohup: redirecting stderr to stdout
-------- WAIT FOR HEALTH --------
Port 8629: [OK]   200
Port 8631: [OK]   200
All services are ready!    [38s]
---------------------------------
D_SERVER_PORTS: 8534,8535
nohup: redirecting stderr to stdout
-------- WAIT FOR HEALTH --------
Port 8534: [OK]   200
Port 8535: [OK]   200
All services are ready!    [37s]
---------------------------------
{"id":"chatcmpl-3dcecc9a-cf59-45a1-876a-e66535c6ef23","object":"chat.completion","created":1765535525,"model":"/root/paddlejob/workspace/env_run/output/models/ERNIE-4.5-0.3B-Paddle/","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I assist you today?","multimodal_content":null,"reasoning_content":null,"audio_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"draft_logprobs":null,"prompt_logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":8,"total_tokens":18,"completion_tokens":10,"prompt_tokens_details":{"cached_tokens":0,"image_tokens":0,"video_tokens":0},"completion_tokens_details":{"reasoning_tokens":0,"image_tokens":0}}}

Checklist

[x] Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
[x] Format your code, run pre-commit before commit.
[x] Add unit tests. Please write the reason in this PR if no unit tests.
[x] Provide accuracy results.
[x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Dec 06 '25 05:12 liyonghua0910

Thanks for your contribution!

Dec 06 '25 05:12 paddle-bot[bot]

Codecov Report

:x: Patch coverage is 68.51852% with 85 lines in your changes missing coverage. Please review. :warning: Please upload report for BASE (develop@c9b47f9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/entrypoints/openai/multi_api_server.py	53.22%	14 Missing and 15 partials :warning:
fastdeploy/engine/common_engine.py	58.33%	18 Missing and 2 partials :warning:
fastdeploy/utils.py	64.51%	5 Missing and 6 partials :warning:
...he_manager/transfer_factory/rdma_cache_transfer.py	77.77%	5 Missing and 3 partials :warning:
fastdeploy/engine/args_utils.py	81.08%	4 Missing and 3 partials :warning:
fastdeploy/engine/engine.py	42.85%	2 Missing and 2 partials :warning:
fastdeploy/engine/expert_service.py	83.33%	0 Missing and 2 partials :warning:
fastdeploy/entrypoints/openai/api_server.py	0.00%	2 Missing :warning:
fastdeploy/worker/worker_process.py	33.33%	1 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5415   +/-   ##
==========================================
  Coverage           ?   62.08%           
==========================================
  Files              ?      329           
  Lines              ?    41287           
  Branches           ?     6295           
==========================================
  Hits               ?    25633           
  Misses             ?    13701           
  Partials           ?     1953

Flag	Coverage Δ
GPU	`62.08% <68.51%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Dec 06 '25 06:12 codecov-commenter