stylable icon indicating copy to clipboard operation
stylable copied to clipboard

Does TensorFlow1x support asycn-training?

Open jiahuiyang opened this issue 3 years ago • 2 comments

Dear All, Does TensorFlow1x support asycn-training? I tried BytePS asycn-training with tensorflow mnist example. After one batch update with server, weights becomes zeros in worker.

jiahuiyang avatar Jul 27 '21 01:07 jiahuiyang

@ymjiang

eric-haibin-lin avatar Jul 28 '21 21:07 eric-haibin-lin

@ymjiang Hi,haibin and yimin I have two problems in asy-training The first one is that delta_w sended to severs are zeros all the time. It seems old_tensors changed as vars changed in tensorflow/_init_.py.

  def apply_gradients(self, *args, **kwargs):
            """Calls this same method on the underlying optimizer."""
            if self._enable_async: # async training
                grads_and_vars = args[0]
                _, vars = zip(*grads_and_vars)
                old_tensors = []
                for var in vars:
                    old_tensors.append(tf.convert_to_tensor(var))
                apply_ops = self._optimizer.apply_gradients(*args, **kwargs)
                with tf.control_dependencies([apply_ops]):
                    # get the delta
                    for i, var in enumerate(vars):
                        old_tensors[i] = tf.subtract(var, old_tensors[i])

                    # reuse the _push_pul_grads(), but is transferring parameters
                    updated_tensors = self._push_pull_grads(old_tensors)

                    # copy the updated variable back
                    assign_op_list = []
                    for i, tensor in enumerate(updated_tensors):
                        assign_op_list.append(tf.assign(vars[i], tensor))

                return control_flow_ops.group(*assign_op_list)
            else:
                return self._optimizer.apply_gradients(*args, **kwargs)

The second one is the tensor's full name to be declared are different between broadcast section and training section. It seems the weight and delta_weight won't be summed because they have different declared key. Pls check def _push_pull(tensor, scope='', name=None) def broadcast(tensor, root_rank, scope='', name=None, is_variable=True): in ops.py.

If I missunderstood something, pls shed light on it. Thanks!

jiahuiyang avatar Jul 29 '21 06:07 jiahuiyang