benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Running distributed_all_reduce in only CPU mode

Open ghost opened this issue 8 years ago • 11 comments

I am running distributed Tensorflow with GRPC protocol on only CPUs. I enabled distributed_all_reduce type of variable update with 'all_reduce_spec = xring':

I am wondering, if this mode is supposed to work for CPU only distributed runs. If yes, then does it need a different controller process in addition to workers.

I am getting errors such as: Unknown device: /job:worker/replica:0/task:2/device:CPU:0 all devices: CPU:0, /job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/device:CPU:0

ghost avatar Oct 13 '17 00:10 ghost

I believe the tf_cnn_benchmark suite in general requires GPUs. The graph construction expects at least one GPU per worker.

poxvoculi avatar Nov 02 '17 18:11 poxvoculi

I believe, the way to run on CPUs is to set num_gpu=1 and set the running device as cpu. Then the parameter_server update algorithm works perfectly fine. I have run lots of tests on CPUs with this. The new mode distributed_all_reduce is giving problems in execution.

ghost avatar Nov 02 '17 18:11 ghost

It has not been tested running on CPU only. I think the problems may be significant in making it work, but if you want to try, look at tensorflow/contrib/all_reduce/python/all_reduce.py. The idea that it's working on GPUs is somewhat baked-in but maybe you can make it work without much change.

poxvoculi avatar Nov 02 '17 19:11 poxvoculi

I will try it. Can you explain what "controller_host" is? Is it supposed to be a different node than workers?

ghost avatar Nov 02 '17 19:11 ghost

See #64

poxvoculi avatar Nov 02 '17 19:11 poxvoculi

Okay. Thank you. I will have time to take a look at it again in a few days.

ghost avatar Nov 02 '17 19:11 ghost

@amathuri Were you able to get the distributed TF working for CPU only? I'd love to get your insight.

Thanks. -Tony

mas-dse-greina avatar Nov 29 '17 21:11 mas-dse-greina

Yes. I have. I have tried it with parameter_server type of variable update and num_gpu=1. To get good performance on CPUs, Tensorflow needs to be built with MKL as backend and also tuning of num_intra_threads/num_inter_threads/ env OMP_NUM_THREADS are required. You may be able to install MKL Tensorflow wheel from here: https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available

What type of insights are you looking for?

ghost avatar Nov 29 '17 21:11 ghost

Excellent. We're running a 4 node CPU cluster and don't seem to be getting it to scale properly. Could you email me at [email protected]? Thanks.

mas-dse-greina avatar Nov 29 '17 22:11 mas-dse-greina

I Did. Thank you.

ghost avatar Nov 29 '17 23:11 ghost

I found that running the script without any parameters can show the CPU performance

Maschinist-LZY avatar Mar 26 '22 17:03 Maschinist-LZY