benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

global_step should be protected by a lock?

Open suiyuan2009 opened this issue 7 years ago • 4 comments

I have met stuck problem when running tf_cnn_benchmarks.py in distributed mode, I think global_step should be protected by a lock in this line.

suiyuan2009 avatar Jul 25 '17 11:07 suiyuan2009

@suiyuan2009 I am asking the team about this problem.

Edit: It looks like Reed is looking at it and responded to the post in tensorflow/tensorflow. Keeping this one assigned to me so you can ping me if things get stale.

tfboyd avatar Jul 25 '17 14:07 tfboyd

In my understanding, there is just one update op put on ps device, and each step every worker will call this op after finishing dependence ops.

I'm sorry I clicked close button because there was some water on my touchpad...

suiyuan2009 avatar Jul 25 '17 14:07 suiyuan2009

We should probably only increment global_step as chief. That way there will be no performance impact from locking, and it makes more sense IMO for the global_step to increment once per step.

@zheng-xq what do you think? Also, can this problem cause deadlock, or would it just occasionally cause global_step to not be incremented as much as it shoud?

@suiyuan2009 can you give the exact command line arguments you used on each worker and PS to run tf_cnn_benchmarks?

reedwm avatar Jul 25 '17 17:07 reedwm

I lost job history, it happens when there are many workers, command is like this. ps

python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4 
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name ps --task_index 0

worker

python3 /home/dongziming/repos/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet101 --server_protocol grpc+verbs --num_batches 600 --num_gpus 4
--variable_update distributed_replicated --local_parameter_device gpu --ps_hosts 10.9.8.120:34213
--worker_hosts 10.9.8.119:33421 --job_name worker --task_index 0

suiyuan2009 avatar Jul 26 '17 02:07 suiyuan2009