Error running in replicated mode
System information: OS Platform: ubuntu 16.04 TensorFlow : install from source Python version: Python 2.7.5
- Run with the command:
python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --local_parameter_device gpu
I got the following Error:
1 images/sec: 314.7 +/- 0.0 (jitter = 0.0) 9.717
2 images/sec: 314.9 +/- 0.2 (jitter = 0.3) 9.874
3 images/sec: 315.5 +/- 0.5 (jitter = 0.7) 9.294
4 images/sec: 315.2 +/- 0.5 (jitter = 0.7) 9.784
5 images/sec: 315.1 +/- 0.4 (jitter = 0.6) 8.846
6 images/sec: 315.0 +/- 0.3 (jitter = 0.5) 8.822
7 images/sec: 315.1 +/- 0.3 (jitter = 0.6) 8.449
8 images/sec: 315.1 +/- 0.3 (jitter = 0.5) 8.233
9 images/sec: 314.9 +/- 0.3 (jitter = 0.6) 8.213
10 images/sec: 314.9 +/- 0.2 (jitter = 0.5) 8.291
11 images/sec: 315.0 +/- 0.2 (jitter = 0.5) 8.054
12 images/sec: 315.1 +/- 0.2 (jitter = 0.8) 8.295
13 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.510
14 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.074
15 images/sec: 315.3 +/- 0.2 (jitter = 0.8) 8.225
16 images/sec: 315.4 +/- 0.2 (jitter = 0.9) 8.041
17 images/sec: 315.4 +/- 0.2 (jitter = 0.9) 8.122
18 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.068
19 images/sec: 315.2 +/- 0.2 (jitter = 0.8) 8.036
20 images/sec: 315.2 +/- 0.2 (jitter = 0.9) 8.120
21 images/sec: 315.3 +/- 0.2 (jitter = 1.0) 8.074
22 images/sec: 315.3 +/- 0.2 (jitter = 1.2) 8.101
23 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.182
24 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.302
25 images/sec: 315.5 +/- 0.2 (jitter = 1.3) 7.991
26 images/sec: 315.5 +/- 0.2 (jitter = 1.2) 8.184
27 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.307
28 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.022
29 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.061
30 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 7.962
31 images/sec: 315.4 +/- 0.2 (jitter = 1.2) 8.218
32 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 7.944
33 images/sec: 315.4 +/- 0.2 (jitter = 1.3) 8.070
34 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.977
35 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.940
36 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 7.910
37 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 6808459.000
38 images/sec: 315.3 +/- 0.2 (jitter = 1.3) 9828381.000
39 images/sec: 315.2 +/- 0.2 (jitter = 1.3) 9444037.000
40 images/sec: 315.2 +/- 0.2 (jitter = 1.4) 11600396.000
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 47, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "tf_cnn_benchmarks.py", line 43, in main
bench.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1097, in run
return self._benchmark_cnn()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1332, in _benchmark_cnn
fetch_summary)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 584, in benchmark_one_step
results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Retval[0] does not have value
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 467, in run
global_step_val, = self.sess.run([self.global_step_op])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1053, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
- Run with the command:
python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device gpu
I got the following Error:
Step Img/sec loss
1 images/sec: 763.6 +/- 0.0 (jitter = 0.0) nan
2 images/sec: 761.4 +/- 1.6 (jitter = 3.3) nan
3 images/sec: 757.3 +/- 3.5 (jitter = 6.6) nan
4 images/sec: 755.0 +/- 3.3 (jitter = 8.1) nan
5 images/sec: 756.0 +/- 2.8 (jitter = 6.6) nan
6 images/sec: 756.6 +/- 2.4 (jitter = 3.5) nan
7 images/sec: 755.3 +/- 2.4 (jitter = 6.6) nan
8 images/sec: 756.8 +/- 2.5 (jitter = 9.1) nan
9 images/sec: 756.6 +/- 2.2 (jitter = 6.6) nan
10 images/sec: 756.5 +/- 2.0 (jitter = 6.6) nan
11 images/sec: 757.3 +/- 2.0 (jitter = 6.6) nan
12 images/sec: 757.8 +/- 1.9 (jitter = 6.2) nan
2018-01-05 19:50:07.566284: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2456] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
2018-01-05 19:50:07.566345: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566369: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566374: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566378: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566382: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566387: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566391: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566395: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566401: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566409: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566430: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566437: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566442: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566448: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566453: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566459: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
2018-01-05 19:50:07.566464: E tensorflow/stream_executor/event.cc:33] error destroying CUDA event in context 0xdab49e0: CUDA_ERROR_ILLEGAL_ADDRESS
- Run with the command:
python tf_cnn_benchmarks.py --num_batches 100 --display_every 1 --num_gus 8 --model resnet50 --batch_size 64 --data_name imagenet --data_dir /root/imagenet_data --xla True --variable_update replicated --all_reduce_spec nccl --local_parameter_device cpu
It seems only use one gpu to compute, and the cpu usage is only 31.2%.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.66 Driver Version: 384.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:00:09.0 Off | 0 | | N/A 27C P0 30W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:00:0A.0 Off | 0 | | N/A 32C P0 32W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-PCIE... On | 00000000:00:0B.0 Off | 0 | | N/A 33C P0 31W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-PCIE... On | 00000000:00:0C.0 Off | 0 | | N/A 34C P0 48W / 250W | 15621MiB / 16276MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla P100-PCIE... On | 00000000:00:0D.0 Off | 0 | | N/A 29C P0 31W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla P100-PCIE... On | 00000000:00:0E.0 Off | 0 | | N/A 29C P0 30W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 Tesla P100-PCIE... On | 00000000:00:0F.0 Off | 0 | | N/A 30C P0 31W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 Tesla P100-PCIE... On | 00000000:00:10.0 Off | 0 | | N/A 27C P0 30W / 250W | 15621MiB / 16276MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 51146 C python 15611MiB | | 1 51146 C python 15611MiB | | 2 51146 C python 15611MiB | | 3 51146 C python 15611MiB | | 4 51146 C python 15611MiB | | 5 51146 C python 15611MiB | | 6 51146 C python 15611MiB | | 7 51146 C python 15611MiB | +-----------------------------------------------------------------------------+
How to use replicated to update variable?
Note you misspelled the --num_gpus command line argument as --num_gus, which will cause only 1 GPU to be used. I'll submit a fix that causes invalid arguments to throw an error.
I cannot reproduce the other errors, either with --num_gusor --num_gpus. Can you try finding the minimal amount of command line arguments needed to reproduce the errors, and see what command line arguments cause the error to occur?
@reedwm Thank you for your help. It is a mistake when I wrote the question, I used the --num_gpus in my test.
And I have another question, can distributed_replicated and distributed_all_reduce modes be used within a single machine ? How to use? What is the worker_hosts and ps_hosts?
Thanks very much!
@reedwm I deleted the command line argument --xla True, it is OK now. Thanks!
distributed_replicated and distributed_all_reduce should be used on multiple machines. They technically can be run on a single machine, by starting multiple processes and setting --gpu_memory_frac_for_testing, but the only reason to do so is for testing and debugging, and it will be slower than replicated. You can set --worker_hosts to something like localhost:1234,localhost:1235 and similarly for --ps_hosts, if running on a single machine.
I'm not sure why setting --xla True causes an error. If you ever find any more information on the issue, please update this issue.
Hi @reedwm I'd like to ask you a few questions, if you don't mind.
- From #65, I think the recommended cluster specification is to run equal numbers of parameter servers and worker servers when use distributed_replicated mode on multiple machines. Each machine runs a ps and a worker. If I have 4 machines, there will be 4 ps and 4 workers. I need run with 8 commands(2 commands per host), am I right?
- If use distributed_all_reduce mode, is it also better to run a single controller and a single worker per machine?
- If I run distributed TensorFlow on large clusters of machines such as 32 machines even more, does it mean I need to run with 64 commands? what should I do?
- The guide here recommends doing so. I am not sure myself what is the most performent. @tfboyd, @zheng-xq, any thoughts?
- In distributed_all_reduce, there is only a single controller host.
- Yes, you would need 64 commands if each machine had a worker and a ps. @tfboyd, I believe there is a script on GitHub that runs tf_cnn_benchmarks on a cluster. Do you have a link?