EasyParallelLibrary
EasyParallelLibrary copied to clipboard
2机2卡实验NCCL报错
使用两个容器进行2机2卡实验,报错如下,希望可以帮忙解决一下
环境:
基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器
脚本:
FastNN的resnet脚本
启动命令
TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh
TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh
报错
2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.397497: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.403631: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
2023-08-31 01:41:08.433142: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal: unhandled system error
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
[[{{node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater}}]]
Traceback (most recent call last):
File "resnet_dp.py", line 92, in <module>
run_model()
File "resnet_dp.py", line 67, in run_model
with tf.train.MonitoredTrainingSession(hooks=hooks) as sess:
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 581, in MonitoredTrainingSession
return MonitoredSession(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1010, in __init__
super(MonitoredSession, self).__init__(
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 319, in init
res = fn(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 639, in create_session
return self._get_session_manager().prepare_session(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 453, in run
assign_ops = _init_local_resources(self, fn)
File "/usr/local/lib/python3.8/dist-packages/epl/parallel/hooks.py", line 423, in _init_local_resources
fn(self, local_resources_init_op)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:worker/replica:0/task:1:
unhandled system error
[[node EPL_PARALLEL_STRATEGY/DATA_PARALLEL_GRADS_REDUCE_0_batch_allreduce_pool_group_0/3/EplNcclCommunicatorCreater (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
为什么两个worker的index都是0,2机应该一个是0,一个是1。