benchmarks
benchmarks copied to clipboard
global_step should be protected by a lock?
I have met stuck problem when running tf_cnn_benchmarks.py
in distributed mode, I think global_step
should be protected by a lock in this line.
@suiyuan2009 I am asking the team about this problem.
Edit: It looks like Reed is looking at it and responded to the post in tensorflow/tensorflow. Keeping this one assigned to me so you can ping me if things get stale.
In my understanding, there is just one update op put on ps device, and each step every worker will call this op after finishing dependence ops.
I'm sorry I clicked close
button because there was some water on my touchpad...
We should probably only increment global_step as chief. That way there will be no performance impact from locking, and it makes more sense IMO for the global_step to increment once per step.
@zheng-xq what do you think? Also, can this problem cause deadlock, or would it just occasionally cause global_step to not be incremented as much as it shoud?
@suiyuan2009 can you give the exact command line arguments you used on each worker and PS to run tf_cnn_benchmarks?
I lost job history, it happens when there are many workers, command is like this. ps
python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name ps --task_index 0
worker
python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name worker --task_index 0