Reed comments

Results 116 comments of


                                            Reed

How to run the benchmark in the distributed mode?

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8...

How to run the benchmark in the distributed mode?

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since `CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ...` will be run.

How to run the benchmark in the distributed mode?

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an `uninitialized error`?

How to run the benchmark in the distributed mode?

@abidmalik1967, yes.

profiling

I'm a bit confused what you're asking. Can you clarify please?

variable_update=parameter_server fails with XLA in distributed mode

/CC @gmagogsfm, any ideas what the issue could be? --xla_compile is not tested with parameter_server, so I'm not surprised it is broken. We should probably raise an error message saying...

Using batch_size * num_gpus as batch_size in exponential_decay calculation?

tf_cnn_benchmarks is correct here. The effective batch size of a model is the batch size per GPU, times the number of GPUs. @nealwu, that other model seems to have an...

Using batch_size * num_gpus as batch_size in exponential_decay calculation?

I'm not an ML expert so it's hard for me to say. For some models, to train with double the batch size, one should double the learning rate. In such...

Alternative/current state of tf_cnn_benchmark

The benchmark is now unmaintained and untested. I do not recommend using it anymore. I think it still is functionally correct and I doubt it will perform worse than it...

Alternative/current state of tf_cnn_benchmark

I miss you too @tfboyd! (this comment is unrelated to this issue BTW @kessel)