benchmarks
benchmarks copied to clipboard
variable_update=parameter_server fails with XLA in distributed mode
Turning on XLA (--xla_compile=True) in distributed mode causes failure during initialization:
2019-01-08 01:10:13.755644: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error Additional GRPC error information: {"created":"@1546909813.755492338","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1036,"grpc_message":"OS Error","grpc_status":14} I0108 01:10:13.867851 140680899847936 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error Additional GRPC error information: {"created":"@1546909813.755492338","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1036,"grpc_message":"OS Error","grpc_status":14}
Tested with both TF1.12 and 0102 nightly (and appropriate versions of benchmark script). Removing the XLA option allows the code to run. distributed_replicated option also works.
Options used: --use_fp16 --xla_compile=True --data_format=NCHW --num_warmup_batches=0 --num_epochs=80 --data_dir=/scratch/imagenet --data_format=NCHW --print_training_accuracy=True --local_parameter_device=gpu --num_gpus=4 --batch_size=512 --model=resnet50 --variable_update=parameter_server --ps_hosts=172.21.1.13:50000,172.21.1.14:50000,172.21.1.130:50000,172.21.1.131:50000,172.21.1.128:50000,172.21.1.129:50000 --worker_hosts=172.21.1.13:50001,172.21.1.14:50001,172.21.1.130:50001,172.21.1.131:50001,172.21.1.128:50001,172.21.1.129:50001
When removing the data_dir argument (i.e. running with synthetic data) a different error appears:
I0109 13:36:52.264183 140677031384832 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Could not colocate node with its resource and reference inputs; devices /job:ps/task:4 and /job:ps/task:3 are not compatible. [[{{node tower_3/v/cluster}}]]
Again, removing the XLA option makes the error to go away.
/CC @gmagogsfm, any ideas what the issue could be?
--xla_compile is not tested with parameter_server, so I'm not surprised it is broken. We should probably raise an error message saying something like "--xla_compile is not compatible with --parameter_server in distributed mode" if we cannot fix this.
Reed, thanks for bringing this to my attention.
@mrmhodak For the GRPC error. Can it be reproduced stably? I am asking because XLA only really starts affecting your computation when you session.run() something. That being said, it should not cause gRPC error when creating sessions.
Could you also share a bit more about your set up? Is it a single host with 4 GPUs?
@gmagogsfm Sorry for the delay.
Retested with latest script and TF 1.13.1 and still the same issue. My setup is 2 nodes with 4 GPU (V100 each). This is reproducible and happens every time. Replacing parameter_server with distributed_replicated makes everything go.
Errors are as follows:
Node 1:
2019-03-15 00:01:00.888408: E tensorflow/core/distributed_runtime/master.cc:315] CreateSession failed because worker /job:worker/replica:0/task:1 returned error: Unavailable: OS Error I0315 00:01:01.000354 140209185412864 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error Initializing graph Traceback (most recent call last): File "./git/benchmarks_03072019/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 72, in <module> app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./git/benchmarks_03072019/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in main bench.run() File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1851, in run return self._benchmark_train() File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2056, in _benchmark_train return self._benchmark_graph(result_to_benchmark, eval_build_results) File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2256, in _benchmark_graph start_standard_services=start_standard_services) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1004, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 832, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 993, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 730, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
Node 2:
>, Could not colocate node with its resource and reference inputs; devices /job:ps/task:0 and /job:ps/task:1 are not compatible. [[{{node tower_3/v/cluster}}]] Initializing graph Traceback (most recent call last): File "./git/benchmarks_03072019/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 72, in <module> app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./git/benchmarks_03072019/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in main bench.run() File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1851, in run return self._benchmark_train() File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2056, in _benchmark_train return self._benchmark_graph(result_to_benchmark, eval_build_results) File "/root/git/benchmarks_03072019/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2256, in _benchmark_graph start_standard_services=start_standard_services) as sess: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1004, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 832, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 993, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 738, in prepare_or_wait_for_session max_wait_secs=max_wait_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 408, in wait_for_session sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 489, in _try_run_local_init_op is_ready_for_local_init, msg = self._model_ready_for_local_init(sess) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 474, in _model_ready_for_local_init "Model not ready for local init") File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 518, in _ready ready_value = sess.run(op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Could not colocate node with its resource and reference inputs; devices /job:ps/task:0 and /job:ps/task:1 are not compatible. [[{{node tower_3/v/cluster}}]]