run llama 3.1 405B with multi node has tp server error [Bug]
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I have two node,node1 and node2,every node eth set is
eth0 is the controller network
eth1 to eth8 is the GPU IB network,
I run llama 3.1 405B by sglang,like that
the error is
Reproduction
sglang:latest
Environment
two node,node1 and node2,every node eth set is
eth0 is the controller network
eth1 to eth8 is the GPU IB network
ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b
I have seen this error (gloo mesh connection failed) with vllm too. I think it is related to your network setup. I wasn't able to find a solution other than using a different set of machines. For me, using H100s from AWS didn't work but using A100s from AWS did work (with exact same OS software and vllm code).
You might want to specific GLOO_SOCKET_IFNAME to your nic interface but it didn't help for me. Other than that, you might want to disable all but the network interface you are using.
In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).
In your case, perhaps gloo only needs eth0 since my understanding is that gloo is only used for some low bandwidth coordination between the nodes using a CPU process group (at least for vllm).
GLOO_SOCKET_IFNAME is set eth0(not roce and IB network),there is no gloo error,
but if --nccl-init-addr is set to eth1 ip,the error is
if --nccl-init-addr is set to eth0‘s ip ,the error is
the --nccl-init-addr is what?
ref https://github.com/sgl-project/sglang?tab=readme-ov-file#run-llama-31-405b
that can't solve my question
Readme says GLOO, but you can also try to set NCCL_SOCKET_IFNAME=<your interface name> as well.
I solved by running docker with args "--shm-size=1g --ulimit memlock=-1" ref: https://help.aliyun.com/zh/egs/support/faq-1 I guess it not relevant to sglang, purely nccl issue
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.