[Bug] Get error when use two nodes
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
I get an error when use two nodes:
Reproduction
# node 1
docker run --gpus '"device=1,2,3,4"' \
--shm-size 32g \
--network=host \
-v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
--name sglang_multinode1 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:dev \
python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# node 2
docker run --gpus '"device=1,2,3,4"' \
--shm-size 32g \
--network=host \
-v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
--name sglang_multinode2 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:dev \
python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
Environment
dokcer is use : lmsysorg/sglang:dev
According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.
A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.
According to the log, it appears that the process termination routine is happening over and over. More specifically, the
sigquit_handler(in engine.py at line 333) is repeatedly callingkill_process_tree(os.getpid()), and inside that function (inutils.pyat line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.A straightforward fix would be to add error handling in the
kill_process_tree()function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.
Thank you .
And I add NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME Info in my code ,like this:
# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0 # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0 # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 1
docker run --gpus '"device=1,2,3,4"' \
--shm-size 32g \
--network=host \
-v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
--name sglang_multinode1 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:dev \
python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0 # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0 # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 2
docker run --gpus '"device=1,2,3,4"' \
--shm-size 32g \
--network=host \
-v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
--name sglang_multinode2 \
-it \
--rm \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc=host \
lmsysorg/sglang:dev \
python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
And I get This error:
Node-1:
Node-2:
my ip addr is :
(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0
altname enp151s0f0np0
3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1
altname enp151s0f1np1
4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff
altname enp50s0f0
5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff
altname enp50s0f1
6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff
altname enp50s0f2
7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff
altname enp50s0f3
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff
inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link
valid_lft forever preferred_lft forever
9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a
valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff
inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:d6ff:fe7d:9edd/64 scope link
valid_lft forever preferred_lft forever
Can you please take a look again?
@jhinpan
Can you try to set NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME to bond0?
According to the log, it appears that the process termination routine is happening over and over. More specifically, the
sigquit_handler(in engine.py at line 333) is repeatedly callingkill_process_tree(os.getpid()), and inside that function (inutils.pyat line 507) the code attempts to send the SIGQUIT signal to the current process over and over again. A straightforward fix would be to add error handling in thekill_process_tree()function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.Thank you . And I add
NCCL_SOCKET_IFNAMEandGLOO_SOCKET_IFNAMEInfo in my code ,like this:# 所有机器上统一设置 export NCCL_SOCKET_IFNAME=ens8f0np0 # 强制NCCL使用以太网卡 export GLOO_SOCKET_IFNAME=ens8f0np0 # 强制Gloo使用以太网卡 # 启用NCCL调试输出 export NCCL_DEBUG=INFO # 启用Gloo调试输出(PyTorch) export GLOO_DEBUG=1 # node 1 docker run --gpus '"device=1,2,3,4"' \ --shm-size 32g \ --network=host \ -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \ --name sglang_multinode1 \ -it \ --rm \ --env "HF_TOKEN=$HF_TOKEN" \ --ipc=host \ lmsysorg/sglang:dev \ python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000# 所有机器上统一设置 export NCCL_SOCKET_IFNAME=ens8f0np0 # 强制NCCL使用以太网卡 export GLOO_SOCKET_IFNAME=ens8f0np0 # 强制Gloo使用以太网卡 # 启用NCCL调试输出 export NCCL_DEBUG=INFO # 启用Gloo调试输出(PyTorch) export GLOO_DEBUG=1 # node 2 docker run --gpus '"device=1,2,3,4"' \ --shm-size 32g \ --network=host \ -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \ --name sglang_multinode2 \ -it \ --rm \ --env "HF_TOKEN=$HF_TOKEN" \ --ipc=host \ lmsysorg/sglang:dev \ python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000And I get This error:
Node-1:
Node-2:
my
ip addris :(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000 link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0 altname enp151s0f0np0 3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000 link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1 altname enp151s0f1np1 4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff altname enp50s0f0 5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff altname enp50s0f1 6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff altname enp50s0f2 7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff altname enp50s0f3 8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link valid_lft forever preferred_lft forever 9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a valid_lft forever preferred_lft forever 10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:d6ff:fe7d:9edd/64 scope link valid_lft forever preferred_lft foreverCan you please take a look again?
Has your problem been resolved? Your exports seem to be at the wrong place, docker run won't inherit your export, you need to add them via --env flag
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.