Cannot do inference for any model on more than two nodes
Hi,
I'm doing model inference for multiple nodes. It works fine with two nodes, but it always throws the following error when it runs on more than two nodes: NCCL WARN NET/IB : collective mismatch error local size 131072 remote 0 addr 7f083e630000 rkey 83000 seq 2/2
I tried both megatron-345M and gpt-175B models. They have the same issue. I'm using 8xA100-40GBs.
The command I used to run it are as follows: two nodes: mpirun -np 16 -npernode 8 -hostfile /job/hostfile -mca btl_tcp_if_exclude lo,docker0 ./bin/gpt_sample three nodes: mpirun -np 24 -npernode 8 -hostfile /job/hostfile -mca btl_tcp_if_exclude lo,docker0 ./bin/gpt_sample
My NCLL version is 2.8.0
Have anyone managed run these models on more than two nodes?
Thanks.
Du
Hi, @duli2012
Can you make sure you have the same ENV setting on all nodes? you can do NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ENV to generate the logs when you run the gpt_example, and please share your gpt configurations (especially tensor_para_size and pipeline_para_size). In other words, it could be that nodes have different topology and env vars.
Also, try to upgrade NCCL to the latest version (2.11 or 2.12), and you can simply run nccl-tests to make sure it works since FT uses NCCL for communication internally.
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.