Reed comments

Results 116 comments of


                                            Reed

Benchmark performance drops significantly when using map_and_batch

Not yet, but I hope to look at it soon. @eladweiss, thank you for your analysis! In benchmark_cnn.py, we set the env var TF_GPU_THREAD_MODE to gpu_private, which gives each GPU...

Distributed Tensorflow fails with "Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized" log msg

This seems similar to #142. Did you try waiting at least a minute for the second worker? If so, can you paste the full output of the second worker after...

Distributed Tensorflow fails with "Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized" log msg

Hmmm, I cannot reproduce. @mrry, any ideas? I am not super familiar with the distributed runtime code so it's hard for me to debug this without being able to reproduce....

Distributed Tensorflow fails with "Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized" log msg

The "CreateSession still waiting for response from worker" are from @antoajayraj's logs. When I run, I get no issues at all. @antoajayraj, per @mrry's advice, can you double check you...

Gradients aggregation for making use of bigger batch size

This is a good idea! I think the right approach would to have one Tensorflow local variable per GPU per variable. Each step, each gradient on a GPU would be...

Gradients aggregation for making use of bigger batch size

A `tf.cond`, with `global_step % 10 == 0` as the condition would work. Alternatively, you could have two fetch ops, X and Y. Op X would apply the gradients from...

ValueError: Boundaries (<dtype: 'int32'>) must have the same dtype as x (<dtype: 'int64'>).

Currently TensorFlow 1.5 works but in general, tf_cnn_benchmarks is not guaranteed to work the latest stable TensorFlow version, and so the nightly builds should be used. If we do break...

Make sure only chief worker can add global_step

That does sound like a better approach, but I didn't write the global_step code so I'm not sure. @zheng-xq is there a reason not to have only the chief update...

tensorflow.python.framework.errors_impl.NotFoundError:

I have compiled with --config=monolothic before and it worked fine. What happens if you just run the line: ``` from tensorflow.contrib.data.python.ops import prefetching_ops ``` Also, try on TensorFlow 1.7.

data parallelism about benchmark distributed training

What @ppwwyyxx said is correct. We do shuffle the data with a buffer size of 10,000, but it's likely that training is suboptimial because we ignore [shift_ratio](https://github.com/tensorflow/benchmarks/blob/aa947092eac44528b89ea3f4021f26c365e3128c/scripts/tf_cnn_benchmarks/preprocessing.py#L440). Unfortunately, we currently...