benchmarks
benchmarks copied to clipboard
benchmark distributed train with 16 workers hangs with error: Too many pings
I started a distributed train with 16 worker (4 gpus per worker) and the worker0 appeared to hang after print "Running warm up" . I checked all the worker, the worker 10 printed as follow:
Running warm up I0801 07:07:38.406057 140097272407808 tf_logging.py:116] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: OS Error I0801 07:07:38.539512 140097272407808 tf_logging.py:116] Graph was finalized. 2018-08-01 07:07:39.364600: I tensorflow/core/distributed_runtime/master_session.cc:1024] Start master session 80690da20064f1bd with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 56 gpu_options { } allow_soft_placement: true I0801 07:07:39.578200 140097272407808 tf_logging.py:116] Running local_init_op. I0801 08:26:57.623552 140097272407808 tf_logging.py:116] An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: Too many pings [[Node: group_deps_3_S110 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=8280192682451314183, tensor_name="edge_108_group_deps_3", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]] I0801 08:26:57.648112 140097272407808 tf_logging.py:116] Graph was finalized. 2018-08-01 08:26:57.743641: I tensorflow/core/distributed_runtime/master_session.cc:1024] Start master session 5e49a92b777818ac with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 56 gpu_options { } allow_soft_placement: true I0801 08:26:57.963944 140097272407808 tf_logging.py:116] Running local_init_op.