benchmarks fix the bug for eval function while variable_update=parameter_server|distributed

fix the bug for eval function while variable_update=parameter_server|distributed_replicated

Open pan463194277 opened this issue 8 years ago • 1 comments

firstly , the _eval function currently doesn't support the mode of 'variable_update=parameter_server' and 'variable_update=distributed_replicated' ,and there will be some mistakes while using the mode of 'replicated' to restore parameters from the checkpoint file that created by training with 'variable_update=parameter_server|distributed_replicated' ,so I changed the 'target' to fix it .

secondly ,while variable_update='distributed_replicated' ,the result of eval function looks not correct. I found that the set of tf.global_variables have no parameters while restoring checkpoint , and even in training ,tf.global_variables() only contained 190+ parameters(these parameters were copied from local_variables and only trainable variables) ,without 'batchnorm/gamma' ,'batchnorm/moving_mean' and 'batchnorm/moving_variance' ,so I changed the code to store/restore parammeters from/to the tf.local_variables and it worked.

Aug 10 '17 07:08 pan463194277

@reedwm Can you take a look? I think you are dealing with this internally. I will merge internal to external to get a better version out here and would like to clean up these PRs first. Thank you.

Aug 18 '17 03:08 tfboyd

benchmarks benchmarks copied to clipboard

fix the bug for eval function while variable_update=parameter_server|distributed_replicated

benchmarks
benchmarks copied to clipboard