lightllm PD分离部署DeepSeeK-R1-FP8模型，起tp16卡的prefill服务报错

[Gloo] Rank 0 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 0 [Gloo] Rank 1 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 1 [Gloo] Rank 2 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 2 [Gloo] Rank 3 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 3 [Gloo] Rank 4 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 4 [Gloo] Rank 5 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 5 [Gloo] Rank 6 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 6 [Gloo] Rank 7 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15 INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 7 INFO 09-30 03:02:12 [manager.py:193] use req queue ChunkedPrefillQueue INFO 09-30 03:02:14 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On All deep_gemm operations loaded successfully! INFO 09-30 03:02:15 [init.py:216] Automatically detected platform cuda. WARNING 09-30 03:02:15 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. WARNING 09-30 03:02:16 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!! Process Process-2:9: Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/lightllm/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/prefill_node_impl/prefill_kv_move_manager.py", line 233, in _init_env manager = PrefillKVMoveManager(args, info_queue, mem_queues) File "/lightllm/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/prefill_node_impl/prefill_kv_move_manager.py", line 40, in init assert self.dp_world_size <= self.node_world_size AssertionError

Sep 30 '25 03:09 wenruihua

@wenruihua 你可以分享下你的启动命令吗。

Sep 30 '25 08:09 hiworldwzj

目前不能支持tp 18 模式下的pd 分离。需要开 dp 18

Sep 30 '25 08:09 hiworldwzj

@wenruihua 你可以分享下你的启动命令吗。

#pd_prefill_0.sh export host=10.24.62.3 export pd_master_ip=10.24.62.3 export nccl_host=10.24.62.3 #nvidia-cuda-mps-control -d LOADWORKER=18 python -m lightllm.server.api_server
--model_dir /mnt/model/DeepSeek-R1
--run_mode "prefill"
--tp 16
--host $host
--port 8019
--nnodes 2
--node_rank 0
--nccl_host $nccl_host
--nccl_port 2732
--enable_fa3
--disable_cudagraph
--pd_master_ip $pd_master_ip
--pd_master_port 8000

#pd_prefill_1.sh export host=10.24.62.9 export pd_master_ip=10.24.62.3 export nccl_host=10.24.62.3 #nvidia-cuda-mps-control -d LOADWORKER=18 python -m lightllm.server.api_server
--model_dir /mnt/model/DeepSeek-R1
--run_mode "prefill"
--tp 16
--host $host
--port 8019
--nnodes 2
--node_rank 1
--nccl_host $nccl_host
--nccl_port 2732
--enable_fa3
--disable_cudagraph
--pd_master_ip $pd_master_ip
--pd_master_port 8000

Sep 30 '25 09:09 wenruihua

目前不能支持tp 18 模式下的pd 分离。需要开 dp 18

我是tp16卡起的deepseek-R1-FP8模型，你的意思是给tp16改成dp16?

Sep 30 '25 09:09 wenruihua

@wenruihua 目前支持不了tp16的 pd 分离，但是 dp 16 是可以做pd 分离的 --tp 16 --dp 16 ，然后加 MOE_MODE=EP 的环境变量。不知道你是什么显卡，如果是H20，可能还有deepep的定制适配问题。

Oct 13 '25 07:10 hiworldwzj

不过dp 16 模式，prefill的性能的延迟是不是特别好的。

Oct 13 '25 07:10 hiworldwzj