sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Get error when use two nodes

Open Tian14267 opened this issue 10 months ago • 3 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [ ] 5. Please use English, otherwise it will be closed.

Describe the bug

I get an error when use two nodes:

log.txt

Reproduction

# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000


# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

Environment

dokcer is use : lmsysorg/sglang:dev

Tian14267 avatar Feb 12 '25 03:02 Tian14267

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

jhinpan avatar Feb 12 '25 03:02 jhinpan

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again.

A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

Thank you . And I add NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME Info in my code ,like this:

# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

And I get This error:

Node-1:

log-node-1.txt

Node-2:

log-node-2.txt

my ip addr is :

(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0
    altname enp151s0f0np0
3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1
    altname enp151s0f1np1
4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f0
5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f1
6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f2
7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f3
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff
    inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link 
       valid_lft forever preferred_lft forever
9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a
       valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff
    inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:d6ff:fe7d:9edd/64 scope link 
       valid_lft forever preferred_lft forever

Can you please take a look again?

@jhinpan

Tian14267 avatar Feb 12 '25 05:02 Tian14267

Can you try to set NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME to bond0?

FrankLeeeee avatar Feb 22 '25 06:02 FrankLeeeee

According to the log, it appears that the process termination routine is happening over and over. More specifically, the sigquit_handler (in engine.py at line 333) is repeatedly calling kill_process_tree(os.getpid()), and inside that function (in utils.py at line 507) the code attempts to send the SIGQUIT signal to the current process over and over again. A straightforward fix would be to add error handling in the kill_process_tree() function so that if it encounters an error (for example, if the process no longer exists or if it detects that it’s already trying to kill itself) it won’t re-enter an infinite loop. You can easily add an error handling method to check for these cases before sending the signal or catching and handling the exception.

Thank you . And I add NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME Info in my code ,like this:

# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 1
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/0_vllm/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode1 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# 所有机器上统一设置
export NCCL_SOCKET_IFNAME=ens8f0np0         # 强制NCCL使用以太网卡
export GLOO_SOCKET_IFNAME=ens8f0np0         # 强制Gloo使用以太网卡
# 启用NCCL调试输出
export NCCL_DEBUG=INFO
# 启用Gloo调试输出(PyTorch)
export GLOO_DEBUG=1
# node 2
docker run --gpus '"device=1,2,3,4"' \
    --shm-size 32g \
    --network=host \
    -v /data/fffan/0_experiment/2_Vllm_test/6_Deepseek/deepseek-ai:/root/deepseek-ai \
    --name sglang_multinode2 \
    -it \
    --rm \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server --model-path /root/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tp 8 --dist-init-addr 10.68.27.13:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

And I get This error:

Node-1:

log-node-1.txt

Node-2:

log-node-2.txt

my ip addr is :

(base) root@ubuntu:/data/fffan/0_experiment/3_SGLang/0_docker/1_two_nodes_allModel# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens8f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c0
    altname enp151s0f0np0
3: ens8f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff permaddr 6c:92:cf:af:54:c1
    altname enp151s0f1np1
4: ens16f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:30 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f0
5: ens16f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:31 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f1
6: ens16f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:32 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f2
7: ens16f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:fe:54:a1:40:33 brd ff:ff:ff:ff:ff:ff
    altname enp50s0f3
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 8e:ce:cf:7f:81:d6 brd ff:ff:ff:ff:ff:ff
    inet 10.68.27.14/24 brd 10.68.27.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::8cce:cfff:fe7f:81d6/64 scope link 
       valid_lft forever preferred_lft forever
9: br-c2260f9ba85a: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:ba:e6:a7:f8 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global br-c2260f9ba85a
       valid_lft forever preferred_lft forever
10: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d6:7d:9e:dd brd ff:ff:ff:ff:ff:ff
    inet 172.250.0.1/20 brd 172.250.15.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:d6ff:fe7d:9edd/64 scope link 
       valid_lft forever preferred_lft forever

Can you please take a look again?

@jhinpan

Has your problem been resolved? Your exports seem to be at the wrong place, docker run won't inherit your export, you need to add them via --env flag

wiwiamlam avatar Feb 26 '25 20:02 wiwiamlam

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Apr 28 '25 00:04 github-actions[bot]