Reed

Results 116 comments of Reed

Not yet, but I hope to look at it soon. @eladweiss, thank you for your analysis! In benchmark_cnn.py, we set the env var TF_GPU_THREAD_MODE to gpu_private, which gives each GPU...

This seems similar to #142. Did you try waiting at least a minute for the second worker? If so, can you paste the full output of the second worker after...

Hmmm, I cannot reproduce. @mrry, any ideas? I am not super familiar with the distributed runtime code so it's hard for me to debug this without being able to reproduce....

The "CreateSession still waiting for response from worker" are from @antoajayraj's logs. When I run, I get no issues at all. @antoajayraj, per @mrry's advice, can you double check you...

This is a good idea! I think the right approach would to have one Tensorflow local variable per GPU per variable. Each step, each gradient on a GPU would be...

A `tf.cond`, with `global_step % 10 == 0` as the condition would work. Alternatively, you could have two fetch ops, X and Y. Op X would apply the gradients from...

Currently TensorFlow 1.5 works but in general, tf_cnn_benchmarks is not guaranteed to work the latest stable TensorFlow version, and so the nightly builds should be used. If we do break...

That does sound like a better approach, but I didn't write the global_step code so I'm not sure. @zheng-xq is there a reason not to have only the chief update...

I have compiled with --config=monolothic before and it worked fine. What happens if you just run the line: ``` from tensorflow.contrib.data.python.ops import prefetching_ops ``` Also, try on TensorFlow 1.7.

What @ppwwyyxx said is correct. We do shuffle the data with a buffer size of 10,000, but it's likely that training is suboptimial because we ignore [shift_ratio](https://github.com/tensorflow/benchmarks/blob/aa947092eac44528b89ea3f4021f26c365e3128c/scripts/tf_cnn_benchmarks/preprocessing.py#L440). Unfortunately, we currently...