Reed
Reed
@zheng-xq what's the story on distributed performance across multiple workers in tf_cnn_benchmarks? As for tf.contrib.distribute, that is the recommended way of distributing across multiple GPUs or workers, as it is...
This is a bug in tf_cnn_benchmarks. When calling `tf.data.Dataset.map(...)` on a function, the function is wrapped in a Defun. Unfortunately, summaries created in Defuns are not part of the default...
/CC @rohan100jain can you implement shift_ratio with datasets?
I agree in general we definitely want to reproduce SOTA training for several common architectures, especially resnet. Currently, we can train resnet50 v1 to about ~74%, while the Slim models...
74% accuracy is good to hear. Feel free to send a pull request with the change. If you do, you might want to send the PR before testing it, so...
Can you clarify what this PR does? With --input_data_format, you move the transpose logic to another part of the code, but you still do the transpose.
This is somewhat expected, as we never optimized for Alexnet. Its per-step time is very small, making overhead from all-reducing gradients take a greater percentage of the time (although I'm...
I'm a bit confused what the issue is. Each worker applies their gradients to the parameter server's variables. Once each worker has done so, they read back the updated parameter's...
`grad = tf.add_n(grads)` is a `sum` without an all-reduce. An all-reduce is simply an `add_n` except you get the output tensor on every device instead of just one device. Note...
On a single device, `add_n` uses the least possible amount memory, as it only allocates its output tensor. The reason `add_n` may use more memory than nccl all-reduce is that...