Reed comments

Results 116 comments of


                                            Reed

data parallelism about benchmark distributed training

@zheng-xq what's the story on distributed performance across multiple workers in tf_cnn_benchmarks? As for tf.contrib.distribute, that is the recommended way of distributing across multiple GPUs or workers, as it is...

How to use tensorboard to save preprocessed images?

This is a bug in tf_cnn_benchmarks. When calling `tf.data.Dataset.map(...)` on a function, the function is wrapped in a Defun. Unfortunately, summaries created in Defuns are not part of the default...

How to split dataset by multi workers?

/CC @rohan100jain can you implement shift_ratio with datasets?

L2 weight decay implementation in TF benchmarks

I agree in general we definitely want to reproduce SOTA training for several common architectures, especially resnet. Currently, we can train resnet50 v1 to about ~74%, while the Slim models...

L2 weight decay implementation in TF benchmarks

74% accuracy is good to hear. Feel free to send a pull request with the change. If you do, you might want to send the PR before testing it, so...

Add option support for native NCHW input data format.

Can you clarify what this PR does? With --input_data_format, you move the transpose logic to another part of the code, but you still do the transpose.

Alexnet training realdata has a bad performance on Multi-GPU with V100

This is somewhat expected, as we never optimized for Alexnet. Its per-step time is very small, making overhead from all-reducing gradients take a greater percentage of the time (although I'm...

The VariableMgrDistributedReplicated decrease the speed of convergence

I'm a bit confused what the issue is. Each worker applies their gradients to the parameter server's variables. Once each worker has done so, they read back the updated parameter's...

How to replace AllReduce with Reduce in parameter server mode?

`grad = tf.add_n(grads)` is a `sum` without an all-reduce. An all-reduce is simply an `add_n` except you get the output tensor on every device instead of just one device. Note...

How to replace AllReduce with Reduce in parameter server mode?

On a single device, `add_n` uses the least possible amount memory, as it only allocates its output tensor. The reason `add_n` may use more memory than nccl all-reduce is that...