EasyParallelLibrary icon indicating copy to clipboard operation
EasyParallelLibrary copied to clipboard

2台服务器分布式跑example中的resnet_split.py遇到无限等待的情况

Open alphabewitch opened this issue 2 years ago • 4 comments

环境 nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像生成的容器 代码: FastNN/resnet/resnet_split.py 执行命令: 服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh 服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况: image 服务器2的执行情况: image 可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。

补充: 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改?

alphabewitch avatar Aug 03 '23 11:08 alphabewitch

可以看看GPU利用率,是否已经开始训练了。

adoda avatar Sep 04 '23 06:09 adoda

请问你这个问题解决了吗?我也遇到了相同的问题,卡住不动了,gpu利用率也是0,并没有开始训练

gyr-kdgc avatar Oct 12 '23 02:10 gyr-kdgc

请问你这个问题解决了吗?我也遇到了相同的问题,卡住不动了,gpu利用率也是0,并没有开始训练

2023-10-12 07:03:18.989342: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:49295 INFO:tensorflow:BROADCAST_0_broadcast_pool tensors: 217 tensors (tf.int64_ref, tf.float64_ref): 358.61 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (1/15): 62 tensors (tf.float64_ref): 31.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (2/15): 22 tensors (tf.float64_ref): 32.04 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (3/15): 6 tensors (tf.float64_ref): 22.03 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (4/15): 4 tensors (tf.float64_ref): 26.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (5/15): 4 tensors (tf.float64_ref): 26.01 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (6/15): 4 tensors (tf.float64_ref): 16.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (7/15): 40 tensors (tf.float64_ref): 32.40 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (8/15): 32 tensors (tf.float64_ref): 31.30 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (9/15): 20 tensors (tf.float64_ref): 27.54 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (10/15): 4 tensors (tf.float64_ref): 20.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (11/15): 4 tensors (tf.float64_ref): 26.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (12/15): 4 tensors (tf.float64_ref): 26.01 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (13/15): 4 tensors (tf.float64_ref): 16.02 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (14/15): 6 tensors (tf.float64_ref): 26.18 MB and 0 dynamic-shaped tensors INFO:tensorflow:BROADCAST_0_broadcast_pool group (15/15): 1 tensors (tf.int64_ref): 0.00 MB and 0 dynamic-shaped tensors 2023-10-12 07:09:48.676384: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error 2023-10-12 07:09:48.676388: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error 2023-10-12 07:09:48.676387: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error 2023-10-12 07:09:48.794784: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: unhandled system error [[{{node BROADCAST_0_broadcast_pool_group_0/0/EplNcclCommunicatorCreater}}]]

卡住过一段时间后报错,nccl的问题,是集群没有配置好吗

gyr-kdgc avatar Oct 12 '23 07:10 gyr-kdgc

问题已解决,记录一下,在script中的.sh脚本命令前面加上NCCL_DEBUG=INFO,发现是nccl通信连接超时,看了日志中的网卡名称和ifconfig中查看的网卡名称是一样的。后来在script中的.sh脚本命令前面加上NCCL_SOCKET_IFNAME=ens192来指定网卡,就可以正常运行了。

gyr-kdgc avatar Oct 13 '23 09:10 gyr-kdgc