lightllm icon indicating copy to clipboard operation
lightllm copied to clipboard

Support disaggregated prefill ?

Open artetaout opened this issue 11 months ago • 8 comments

I saw your code referring to PD disaggragate. Please tell me how to use it

artetaout avatar Jan 16 '25 08:01 artetaout

demo start args.

pd master

python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "pd_master"
--host hostname -i
--port 60011

prefill node

nvidia-cuda-mps-control -d KV_TRANS_USE_P2P=1 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "prefill"
--host hostname -i
--port 8017
--tp 4
--nccl_port 2732
--max_total_token_num 400000
--tokenizer_mode fast
--pd_master_ip 10.121.4.14 \
--pd_master_port 60011
--use_dynamic_prompt_cache
--max_req_total_len 16000
--running_max_req_size 128
--disable_cudagraph

decode node

nvidia-cuda-mps-control -d CUDA_VISIBLE_DEVICES=4,5,6,7 KV_TRANS_USE_P2P=1 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b
--run_mode "decode"
--host hostname -i
--port 8118
--nccl_port 12322
--tp 4
--max_total_token_num 400000
--graph_max_len_in_batch 2048
--graph_max_batch_size 16
--tokenizer_mode fast
--pd_master_ip 10.121.4.14
--pd_master_port 60011
--use_dynamic_prompt_cache

not all model and run mode suppport pd.

hiworldwzj avatar Jan 23 '25 02:01 hiworldwzj

demo start args.

pd master

python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "pd_master" --host hostname -i --port 60011

prefill node

nvidia-cuda-mps-control -d KV_TRANS_USE_P2P=1 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "prefill" --host hostname -i --port 8017 --tp 4 --nccl_port 2732 --max_total_token_num 400000 --tokenizer_mode fast --pd_master_ip 10.121.4.14 \ --pd_master_port 60011 --use_dynamic_prompt_cache --max_req_total_len 16000 --running_max_req_size 128 --disable_cudagraph

decode node

nvidia-cuda-mps-control -d CUDA_VISIBLE_DEVICES=4,5,6,7 KV_TRANS_USE_P2P=1 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "decode" --host hostname -i --port 8118 --nccl_port 12322 --tp 4 --max_total_token_num 400000 --graph_max_len_in_batch 2048 --graph_max_batch_size 16 --tokenizer_mode fast --pd_master_ip 10.121.4.14 --pd_master_port 60011 --use_dynamic_prompt_cache

not all model and run mode suppport pd.

How is the performance of disaggregated prefill? Does it support multiple P nodes and multiple D nodes(xPyD)?

Dimensionzw avatar Feb 11 '25 07:02 Dimensionzw

demo start args.

pd master

python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "pd_master" --host hostname -i --port 60011

prefill node

nvidia-cuda-mps-control -d KV_TRANS_USE_P2P=1 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "prefill" --host hostname -i --port 8017 --tp 4 --nccl_port 2732 --max_total_token_num 400000 --tokenizer_mode fast --pd_master_ip 10.121.4.14 \ --pd_master_port 60011 --use_dynamic_prompt_cache --max_req_total_len 16000 --running_max_req_size 128 --disable_cudagraph

decode node

nvidia-cuda-mps-control -d CUDA_VISIBLE_DEVICES=4,5,6,7 KV_TRANS_USE_P2P=1 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /dev/shm/llama2-7b --run_mode "decode" --host hostname -i --port 8118 --nccl_port 12322 --tp 4 --max_total_token_num 400000 --graph_max_len_in_batch 2048 --graph_max_batch_size 16 --tokenizer_mode fast --pd_master_ip 10.121.4.14 --pd_master_port 60011 --use_dynamic_prompt_cache

not all model and run mode suppport pd.

When using pd to separate according to the above command, an error is reported:

master log

INFO 02-12 16:56:42 [manager.py:147] recieved req X-Request-Id: X-Session-Id: start_time:2025-02-12 16:56:42 lightllm_req_id:0 
INFO 02-12 16:57:05 [statics_utils.py:24] mean first cost: 388.7448310852051 ms
INFO 02-12 16:57:05 [statics_utils.py:24] mean per token cost: 0.0 ms
INFO 02-12 16:57:35 [statics_utils.py:24] mean first cost: 388.7448310852051 ms
INFO 02-12 16:57:35 [statics_utils.py:24] mean per token cost: 0.0 ms
WARNING 02-12 16:57:43 [manager.py:221] group_request_id: 0 kv move time out err
ERROR 02-12 16:57:43 [manager.py:135] has exception req_id 0 kv move time out, server is busy
WARNING 02-12 16:57:43 [manager.py:304] aborted group_request_id 0
ERROR 02-12 16:57:43 [api_http.py:178] An error occurred: req_id 0 kv move time out, server is busy
ERROR 02-12 16:57:43 [api_http.py:178] Traceback (most recent call last):
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/asyncio/locks.py", line 214, in wait
ERROR 02-12 16:57:43 [api_http.py:178]     await fut
ERROR 02-12 16:57:43 [api_http.py:178] asyncio.exceptions.CancelledError
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] During handling of the above exception, another exception occurred:
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] Traceback (most recent call last):
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 02-12 16:57:43 [api_http.py:178]     return fut.result()
ERROR 02-12 16:57:43 [api_http.py:178] asyncio.exceptions.CancelledError
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] The above exception was the direct cause of the following exception:
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] Traceback (most recent call last):
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 219, in fetch_stream
ERROR 02-12 16:57:43 [api_http.py:178]     await asyncio.wait_for(up_status_event.wait(), timeout=60)
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 02-12 16:57:43 [api_http.py:178]     raise exceptions.TimeoutError() from exc
ERROR 02-12 16:57:43 [api_http.py:178] asyncio.exceptions.TimeoutError
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] During handling of the above exception, another exception occurred:
ERROR 02-12 16:57:43 [api_http.py:178] 
ERROR 02-12 16:57:43 [api_http.py:178] Traceback (most recent call last):
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/api_http.py", line 176, in generate
ERROR 02-12 16:57:43 [api_http.py:178]     return await g_objs.g_generate_func(request, g_objs.httpserver_manager)
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/api_lightllm.py", line 55, in lightllm_generate
ERROR 02-12 16:57:43 [api_http.py:178]     async for sub_req_id, request_output, metadata, finish_status in results_generator:
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 137, in generate
ERROR 02-12 16:57:43 [api_http.py:178]     raise e
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 131, in generate
ERROR 02-12 16:57:43 [api_http.py:178]     async for sub_req_id, request_output, metadata, finish_status in results_generator:
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 257, in _wait_to_token_package
ERROR 02-12 16:57:43 [api_http.py:178]     async for sub_req_id, out_str, metadata, finish_status in self.fetch_stream(
ERROR 02-12 16:57:43 [api_http.py:178]   File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 222, in fetch_stream
ERROR 02-12 16:57:43 [api_http.py:178]     assert False, f"req_id {group_request_id} kv move time out, server is busy"
ERROR 02-12 16:57:43 [api_http.py:178] AssertionError: req_id 0 kv move time out, server is busy
127.0.0.1:49236 - "POST /generate HTTP/1.1" 417
INFO 02-12 16:58:35 [statics_utils.py:24] mean first cost: 388.7448310852051 ms
INFO 02-12 16:58:35 [statics_utils.py:24] mean per token cost: 0.0 ms

prefil node log

INFO 02-12 16:56:43 [manager.py:384] X-Request-Id: X-Session-Id: start_time:2025-02-12 16:56:42 lightllm_req_id:0 first_token_cost:355.682373046875ms total_cost_time:355.70740699768066ms,out_token_counter:1 mean_per_token_cost_time: 0.025033950805664062ms prompt_token_num:4 prompt_cache_len:0 prompt_cache_ratio:0.0 
INFO 02-12 16:56:43 [shm_req_manager.py:117] all shm req has been release ok
INFO 02-12 16:56:43 [rpyc_fix_utils.py:36] change socket buffer from 12582912 12582912 change to 4194304
INFO 02-12 16:56:43 [prefill_trans_process.py:95] trans kv process start, nccl_ip: 127.0.1.1, nccl_port: 20000
INFO 02-12 16:56:46 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
ERROR 02-12 16:56:46 [_custom_ops.py:16] vllm or lightllm_kernel is not installed, you can't use custom ops
INFO 02-12 16:56:47 [communication_op.py:41] vllm or lightllm_kernel is not installed, you can't use custom allreduce
INFO 02-12 16:56:47 [communication_op.py:48] lightllm_kernel is not installed, you can't use custom allgather
INFO 02-12 16:56:49 [prefill_infer_rpyc.py:47] put mem manager to mem_queue ok
INFO 02-12 16:56:49 [prefill_infer_rpyc.py:47] put mem manager to mem_queue ok
[W212 16:56:50.983097806 socket.cpp:759] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.1.1]:20000 (errno: 97 - Address family not supported by protocol).
INFO 02-12 16:56:56 [prefill_kv_move_manager.py:157] request_kv_trans_loop get task id: 0 in_len:4 v_len: 4 move_len: None dp_index:0 queue time 13.332210302352905 s 
INFO 02-12 16:56:56 [prefill_kv_move_manager.py:171] request_kv_trans_loop request_data_transfer ok, id: 0 in_len:4 v_len: 4 move_len: None dp_index:0 cost time: 0.007356405258178711 s
INFO 02-12 16:56:56 [prefill_kv_move_manager.py:232] kv_trans_handle_loop get task id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0 to start kv movequeue time 13.342423915863037 s 
INFO 02-12 16:56:56 [prefill_trans_process.py:58] trans start: id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0
INFO 02-12 16:56:57 [prefill_trans_process.py:64] trans finished: id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0 move len: 4
INFO 02-12 16:56:58 [prefill_trans_process.py:66] trans cost time: 2.027672529220581,move_total_kv_len: 4, id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0
INFO 02-12 16:56:58 [prefill_kv_move_manager.py:206] _transfer_kv data ok, req_id: 0 cost total time: 15.371265172958374 s
INFO 02-12 16:56:58 [prefill_infer_rpyc.py:36] unfrozen tokens for req id: 0
INFO 02-12 16:56:59 [statics_utils.py:24] mean first cost: 355.682373046875 ms
INFO 02-12 16:56:59 [statics_utils.py:24] mean per token cost: 0.025033950805664062 ms
INFO 02-12 16:57:29 [statics_utils.py:24] mean first cost: 355.682373046875 ms
INFO 02-12 16:57:29 [statics_utils.py:24] mean per token cost: 0.025033950805664062 ms
WARNING 02-12 16:57:43 [manager.py:418] aborted group_request_id not exist
INFO 02-12 16:58:29 [statics_utils.py:24] mean first cost: 355.682373046875 ms
INFO 02-12 16:58:29 [statics_utils.py:24] mean per token cost: 0.025033950805664062 ms
INFO 02-12 16:58:59 [statics_utils.py:24] mean first cost: 355.682373046875 ms

decode node log

INFO 02-12 16:56:53 [communication_op.py:48] lightllm_kernel is not installed, you can't use custom allgather
INFO 02-12 16:56:56 [decode_infer_rpyc.py:162] put mem manager to info_queues ok
INFO 02-12 16:56:56 [decode_infer_rpyc.py:162] put mem manager to info_queues ok
INFO 02-12 16:56:56 [decode_infer_rpyc.py:162] put mem manager to info_queues ok
INFO 02-12 16:56:56 [decode_infer_rpyc.py:162] put mem manager to info_queues ok
INFO 02-12 16:56:56 [decode_infer_rpyc.py:177] kv time out reqs: []
[W212 16:56:56.031492389 socket.cpp:759] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.1.1]:20000 (errno: 97 - Address family not supported by protocol).
INFO 02-12 16:56:56 [decode_kv_move_manager.py:425] exposed_request_data_transfer in id: 0 in_len:4 v_len: None move_len: None dp_index:None, type <class 'lightllm.server.pd_io_struct.KVMoveTask'>
INFO 02-12 16:56:56 [decode_kv_move_manager.py:129] kv_move_loop get task id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0
INFO 02-12 16:56:56 [decode_trans_process.py:56] trans start: id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0
INFO 02-12 16:56:58 [decode_trans_process.py:61] trans finished: id: 0 in_len:4 v_len: 4 move_len: 4 dp_index:0 move len: 4
INFO 02-12 16:56:59 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:03 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:06 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:10 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:13 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:15 [statics_utils.py:24] mean first cost: 0.0 ms
INFO 02-12 16:57:15 [statics_utils.py:24] mean per token cost: 0.0 ms
INFO 02-12 16:57:17 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:20 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:24 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:27 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:31 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:34 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:38 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:41 [decode_infer_rpyc.py:177] kv time out reqs: []
WARNING 02-12 16:57:43 [manager.py:418] aborted group_request_id not exist
INFO 02-12 16:57:45 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:48 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:52 [decode_infer_rpyc.py:177] kv time out reqs: []
INFO 02-12 16:57:55 [decode_infer_rpyc.py:177] kv time out reqs: []
ERROR 02-12 16:57:56 [decode_kv_move_manager.py:141] 
Traceback (most recent call last):
  File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/decode_node_impl/decode_kv_move_manager.py", line 138, in kv_move_loop
    self._transfer_kv(move_tasks)
  File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/site-packages/lightllm-3.0.0-py3.10.egg/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/decode_node_impl/decode_kv_move_manager.py", line 105, in _transfer_kv
    assert self.task_out_queue.get(timeout=60) == "ok"
  File "/opt/nas/p/conda/envs/zhangwei/lib/python3.10/multiprocessing/queues.py", line 114, in get
    raise Empty
_queue.Empty
ERROR 02-12 16:57:56 [decode_kv_move_manager.py:149] kv_move_loop prefill id 27543024419974231882313753204271458421 device_index 0 thread quit
ERROR 02-12 16:57:56 [decode_kv_move_manager.py:187] put_to_radix_loop, prefill id 27543024419974231882313753204271458421 device_index 0 thread quit
ERROR 02-12 16:57:56 [decode_kv_move_manager.py:227] trans obj del start, prefill node id 27543024419974231882313753204271458421 device_index 0
ERROR 02-12 16:57:56 [decode_kv_move_manager.py:239] trans obj deled, prefill node id 27543024419974231882313753204271458421 device_index 0
WARNING 02-12 16:57:56 [decode_kv_move_manager.py:243] trans kv process 2113474 is killed
DEBUG 02-12 16:57:56 [mem_manager.py:266] freed all gpu mem size 400000
DEBUG 02-12 16:57:56 [mem_manager.py:266] freed all gpu mem size 400000

At the same time, my startup command is: master

nvidia-cuda-mps-control -d
CUDA_VISIBLE_DEVICES=0 python -m lightllm.server.api_server --model_dir /opt/nas/p/zhangwei/model_hub/Qwen2.5-7B-Instruct --model_name qwen --run_mode "pd_master" --host 0.0.0.0 --port 60011

prefil

CUDA_VISIBLE_DEVICES=0,1 KV_TRANS_USE_P2P=1 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /opt/nas/p/zhangwei/model_hub/Qwen2.5-7B-Instruct --run_mode "prefill" --host 0.0.0.0 --port 8017 --tp 2 --nccl_port 2732 --max_total_token_num 400000 --tokenizer_mode fast --pd_master_ip 0.0.0.0  --pd_master_port 60011 --use_dynamic_prompt_cache --max_req_total_len 16000 --running_max_req_size 128 --disable_cudagraph

decode

CUDA_VISIBLE_DEVICES=2,3,4,5 KV_TRANS_USE_P2P=1 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /opt/nas/p/zhangwei/model_hub/Qwen2.5-7B-Instruct --run_mode "decode" --host 0.0.0.0 --port 8118 --nccl_port 12322 --tp 4 --max_total_token_num 400000 --graph_max_len_in_batch 2048 --graph_max_batch_size 16 --tokenizer_mode fast --pd_master_ip 0.0.0.0 --pd_master_port 60011 --use_dynamic_prompt_cache

Have you encountered the above problems? Is there any solution?

Dimensionzw avatar Feb 12 '25 09:02 Dimensionzw

@Dimensionzw prefill node and decode node need use same --tp params. currently this constraint is needed.

hiworldwzj avatar Feb 19 '25 06:02 hiworldwzj

Does support multi-node PD disaggregated currently ?

sitabulaixizawaluduo avatar Feb 19 '25 11:02 sitabulaixizawaluduo

So does it support multiple P nodes and multiple D nodes? If supported, please provide an example, e.g. 1P2D

binbabou avatar Feb 24 '25 10:02 binbabou

@sitabulaixizawaluduo @DayDayupupupup supported now, only start p and d, the pd master will manager all details.

hiworldwzj avatar Mar 05 '25 06:03 hiworldwzj

Now,I try to use PD disaggregated with 2 nodes. Every node have 8 V100 with NVlinks Commands: master:CUDA_VISIBLE_DEVICES=0 python -m lightllm.server.api_server \ --model_dir /share/models/meta-llama/Llama-3.2-1B-Instruct \ --run_mode "pd_master" \ --host 10.0.0.103 \ --port 8000 &

prefill:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 LOADWORKER=1 python -m lightllm.server.api_server --model_dir /share/models/meta-llama/Llama-3.2-1B-Instruct \ --run_mode "prefill" \ --host 10.0.0.103 \ --port 8017 \ --tp 8 \ --nccl_port 2732 \ --max_total_token_num 200000 \ --data_type float32 \ --mem_fraction 0.8 \ --tokenizer_mode fast \ --pd_master_ip 10.0.0.103 \ --pd_master_port 8000 \ --max_req_total_len 16000 \ --running_max_req_size 200 \ --disable_chunked_prefill \ --disable_custom_allreduce \ --disable_custom_allgather \ --disable_cudagraph &

decode:CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 LOADWORKER=10 python -m lightllm.server.api_server --model_dir /share/models/meta-llama/Llama-3.2-1B-Instruct \ --run_mode "decode" \ --host 10.0.0.101 \ --port 8118 \ --nccl_port 12322 \ --tp 8 \ --max_total_token_num 200000 \ --data_type float32 \ --mem_fraction 0.8 \ --disable_cudagraph \ --disable_chunked_prefill \ --disable_custom_allreduce \ --disable_custom_allgather \ --tokenizer_mode fast \ --pd_master_ip 10.0.0.103 \ --pd_master_port 8000 &

When I send request,the error occurs: prefill:error while connect to decode node: PDTransJoinInfo(decode_id=49503501587641736076224139062164708495, decode_device_id=-1, prefill_id=220522734801353269567686227577997274913, prefill_device_id=0, pd_prefill_nccl_ip='127.0.1.1', pd_prefill_nccl_port=20000, connect_id='27d42696-a230-44f2-8a2d-25eb9eede104') connect time out node_info PDTransJoinInfo(decode_id=49503501587641736076224139062164708495, decode_device_id=-1, prefill_id=220522734801353269567686227577997274913, prefill_device_id=0, pd_prefill_nccl_ip='127.0.1.1', pd_prefill_nccl_port=20000, connect_id='27d42696-a230-44f2-8a2d-25eb9eede104') decode:TCP client failed to connect/validate to host 127.0.1.1:20000 - timed out (try=1, timeout=30000ms): The client socket has timed out after 30000ms while trying to connect to (127.0.1.1, 20000). Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1025 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74307436b1b6 in /home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/torch/lib/libc10.so) and Traceback (most recent call last): ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/api_http.py", line 177, in generate ERROR 05-22 08:19:00 [api_http.py:179] return await g_objs.g_generate_func(request, g_objs.httpserver_manager) ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/api_lightllm.py", line 55, in lightllm_generate ERROR 05-22 08:19:00 [api_http.py:179] async for sub_req_id, request_output, metadata, finish_status in results_generator: ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 137, in generate ERROR 05-22 08:19:00 [api_http.py:179] raise e ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 131, in generate ERROR 05-22 08:19:00 [api_http.py:179] async for sub_req_id, request_output, metadata, finish_status in results_generator: ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 258, in _wait_to_token_package ERROR 05-22 08:19:00 [api_http.py:179] async for sub_req_id, out_str, metadata, finish_status in self.fetch_stream( ERROR 05-22 08:19:00 [api_http.py:179] File "/home/lq/anaconda3/envs/lightllm/lib/python3.9/site-packages/lightllm-1.0.1-py3.9.egg/lightllm/server/httpserver_for_pd_master/manager.py", line 223, in fetch_stream ERROR 05-22 08:19:00 [api_http.py:179] assert False, f"req_id {group_request_id} kv move time out, server is busy" ERROR 05-22 08:19:00 [api_http.py:179] AssertionError: req_id 8 kv move time out, server is busy The connection between two nodes is OK.

67lc avatar May 22 '25 08:05 67lc