Reed

Results 116 comments of Reed

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8...

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since `CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ...` will be run.

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an `uninitialized error`?

I'm a bit confused what you're asking. Can you clarify please?

/CC @gmagogsfm, any ideas what the issue could be? --xla_compile is not tested with parameter_server, so I'm not surprised it is broken. We should probably raise an error message saying...

tf_cnn_benchmarks is correct here. The effective batch size of a model is the batch size per GPU, times the number of GPUs. @nealwu, that other model seems to have an...

I'm not an ML expert so it's hard for me to say. For some models, to train with double the batch size, one should double the learning rate. In such...

The benchmark is now unmaintained and untested. I do not recommend using it anymore. I think it still is functionally correct and I doubt it will perform worse than it...

I miss you too @tfboyd! (this comment is unrelated to this issue BTW @kessel)