benchmarks Error running in replicated mode

System information： OS Platform: ubuntu 16.04 TensorFlow : install from source Python version: Python 2.7.5

Run with the command: python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --local_parameter_device gpu

I got the following Error:

1	images/sec: 314.7 +/- 0.0 (jitter = 0.0)	9.717
2	images/sec: 314.9 +/- 0.2 (jitter = 0.3)	9.874
3	images/sec: 315.5 +/- 0.5 (jitter = 0.7)	9.294
4	images/sec: 315.2 +/- 0.5 (jitter = 0.7)	9.784
5	images/sec: 315.1 +/- 0.4 (jitter = 0.6)	8.846
6	images/sec: 315.0 +/- 0.3 (jitter = 0.5)	8.822
7	images/sec: 315.1 +/- 0.3 (jitter = 0.6)	8.449
8	images/sec: 315.1 +/- 0.3 (jitter = 0.5)	8.233
9	images/sec: 314.9 +/- 0.3 (jitter = 0.6)	8.213
10	images/sec: 314.9 +/- 0.2 (jitter = 0.5)	8.291
11	images/sec: 315.0 +/- 0.2 (jitter = 0.5)	8.054
12	images/sec: 315.1 +/- 0.2 (jitter = 0.8)	8.295
13	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.510
14	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.074
15	images/sec: 315.3 +/- 0.2 (jitter = 0.8)	8.225
16	images/sec: 315.4 +/- 0.2 (jitter = 0.9)	8.041
17	images/sec: 315.4 +/- 0.2 (jitter = 0.9)	8.122
18	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.068
19	images/sec: 315.2 +/- 0.2 (jitter = 0.8)	8.036
20	images/sec: 315.2 +/- 0.2 (jitter = 0.9)	8.120
21	images/sec: 315.3 +/- 0.2 (jitter = 1.0)	8.074
22	images/sec: 315.3 +/- 0.2 (jitter = 1.2)	8.101
23	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.182
24	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.302
25	images/sec: 315.5 +/- 0.2 (jitter = 1.3)	7.991
26	images/sec: 315.5 +/- 0.2 (jitter = 1.2)	8.184
27	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.307
28	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.022
29	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.061
30	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	7.962
31	images/sec: 315.4 +/- 0.2 (jitter = 1.2)	8.218
32	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	7.944
33	images/sec: 315.4 +/- 0.2 (jitter = 1.3)	8.070
34	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.977
35	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.940
36	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	7.910
37	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	6808459.000
38	images/sec: 315.3 +/- 0.2 (jitter = 1.3)	9828381.000
39	images/sec: 315.2 +/- 0.2 (jitter = 1.3)	9444037.000
40	images/sec: 315.2 +/- 0.2 (jitter = 1.4)	11600396.000
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 47, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "tf_cnn_benchmarks.py", line 43, in main
    bench.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1097, in run
    return self._benchmark_cnn()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1332, in _benchmark_cnn
    fetch_summary)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 584, in benchmark_one_step
    results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[0] does not have value
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 467, in run
    global_step_val, = self.sess.run([self.global_step_op])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1053, in _run
    raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.

Run with the command: python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device gpu

I got the following Error:

Step	Img/sec	loss
1	images/sec: 763.6 +/- 0.0 (jitter = 0.0)	nan
2	images/sec: 761.4 +/- 1.6 (jitter = 3.3)	nan
3	images/sec: 757.3 +/- 3.5 (jitter = 6.6)	nan
4	images/sec: 755.0 +/- 3.3 (jitter = 8.1)	nan
5	images/sec: 756.0 +/- 2.8 (jitter = 6.6)	nan
6	images/sec: 756.6 +/- 2.4 (jitter = 3.5)	nan
7	images/sec: 755.3 +/- 2.4 (jitter = 6.6)	nan
8	images/sec: 756.8 +/- 2.5 (jitter = 9.1)	nan
9	images/sec: 756.6 +/- 2.2 (jitter = 6.6)	nan
10	images/sec: 756.5 +/- 2.0 (jitter = 6.6)	nan
11	images/sec: 757.3 +/- 2.0 (jitter = 6.6)	nan
12	images/sec: 757.8 +/- 1.9 (jitter = 6.2)	nan
2018-01-05 19:50:07.566284: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2456] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
2018-01-05 19:50:07.566345: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566369: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566374: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566378: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566382: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566387: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566391: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566395: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566401: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566409: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566430: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566437: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566442: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566448: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566453: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566459: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566464: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS

Jan 05 '18 12:01 Agoniii

Run with the command: python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device cpu

It seems only use one gpu to compute, and the cpu usage is only 31.2%.

Jan 05 '18 12:01 Agoniii

How to use replicated to update variable?

Jan 05 '18 12:01 Agoniii

Note you misspelled the --num_gpus command line argument as --num_gus, which will cause only 1 GPU to be used. I'll submit a fix that causes invalid arguments to throw an error.

I cannot reproduce the other errors, either with --num_gusor --num_gpus. Can you try finding the minimal amount of command line arguments needed to reproduce the errors, and see what command line arguments cause the error to occur?

Jan 05 '18 18:01 reedwm

@reedwm Thank you for your help. It is a mistake when I wrote the question, I used the --num_gpus in my test. And I have another question, can distributed_replicated and distributed_all_reduce modes be used within a single machine ? How to use? What is the worker_hosts and ps_hosts? Thanks very much!

Jan 06 '18 02:01 Agoniii

@reedwm I deleted the command line argument --xla True, it is OK now. Thanks!

Jan 06 '18 02:01 Agoniii

distributed_replicated and distributed_all_reduce should be used on multiple machines. They technically can be run on a single machine, by starting multiple processes and setting --gpu_memory_frac_for_testing, but the only reason to do so is for testing and debugging, and it will be slower than replicated. You can set --worker_hosts to something like localhost:1234,localhost:1235 and similarly for --ps_hosts, if running on a single machine.

I'm not sure why setting --xla True causes an error. If you ever find any more information on the issue, please update this issue.

Jan 09 '18 01:01 reedwm

Hi @reedwm I'd like to ask you a few questions, if you don't mind.

From #65, I think the recommended cluster specification is to run equal numbers of parameter servers and worker servers when use distributed_replicated mode on multiple machines. Each machine runs a ps and a worker. If I have 4 machines, there will be 4 ps and 4 workers. I need run with 8 commands(2 commands per host), am I right?
If use distributed_all_reduce mode, is it also better to run a single controller and a single worker per machine?
If I run distributed TensorFlow on large clusters of machines such as 32 machines even more, does it mean I need to run with 64 commands? what should I do?

Jan 16 '18 05:01 Agoniii

The guide here recommends doing so. I am not sure myself what is the most performent. @tfboyd, @zheng-xq, any thoughts?
In distributed_all_reduce, there is only a single controller host.
Yes, you would need 64 commands if each machine had a worker and a ps. @tfboyd, I believe there is a script on GitHub that runs tf_cnn_benchmarks on a cluster. Do you have a link?

Jan 16 '18 18:01 reedwm