sglang run llama 3.1 405B with multi node has tp server error [Bug]

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I have two node,node1 and node2,every node eth set is eth0 is the controller network eth1 to eth8 is the GPU IB network, I run llama 3.1 405B by sglang,like that the error is

Reproduction

sglang:latest

Environment

two node,node1 and node2,every node eth set is
eth0 is the controller network
eth1 to eth8 is the GPU IB network

Aug 01 '24 13:08 kinglion811

ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b

Aug 01 '24 14:08 zhyncs

I have seen this error (gloo mesh connection failed) with vllm too. I think it is related to your network setup. I wasn't able to find a solution other than using a different set of machines. For me, using H100s from AWS didn't work but using A100s from AWS did work (with exact same OS software and vllm code).

You might want to specific GLOO_SOCKET_IFNAME to your nic interface but it didn't help for me. Other than that, you might want to disable all but the network interface you are using.

Aug 01 '24 17:08 min-xu-et

In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).

Aug 01 '24 17:08 min-xu-et

In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).

GLOO_SOCKET_IFNAME is set eth0(not roce and IB network)，there is no gloo error， but if --nccl-init-addr is set to eth1 ip，the error is if --nccl-init-addr is set to eth0‘s ip ，the error is

the --nccl-init-addr is what？

Aug 02 '24 03:08 kinglion811

ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b

that can't solve my question

Aug 02 '24 03:08 kinglion811

Readme says GLOO, but you can also try to set NCCL_SOCKET_IFNAME=<your interface name> as well.

Aug 02 '24 18:08 hrukalive

I solved by running docker with args "--shm-size=1g --ulimit memlock=-1" ref: https://help.aliyun.com/zh/egs/support/faq-1 I guess it not relevant to sglang, purely nccl issue

Aug 08 '24 07:08 db24

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Oct 08 '24 01:10 github-actions[bot]