unif icon indicating copy to clipboard operation
unif copied to clipboard

单机多卡

Open yupeijei1997 opened this issue 3 years ago • 3 comments

您好~

在使用unif的过程中,对下面这个函数有点疑惑,您用空的时候看看哈~

如下函数求梯度的平均值时,如果grad是IndexedSlices类型的话,对value求平均,而indices则取第一个grad的indices; 感觉每个grad的indices是不一样的,假如是四卡的情况,一个batch被分成四分,其数据是不一样的,那取得应该是embedding_table矩阵的不同行;

这样的话,直接取第一个grad的indices作为indices感觉漏掉了embedding_table里一些参数的梯度;这里的value直接取平均的话,意思是把embedding_table里不同batch里的不同行的梯度值进行平均,感觉是不同参数的梯度值取了平均,直觉上是相同参数的梯度值取平均,所以感觉有些奇怪。看网上有的单机多卡的梯度平均实现是,不管是不是IndexedSlices类型,都直接用tf.divide(tf.add_n(split_grads), len(split_grads))来求平均,也不知道这样能解决我说的疑惑嘛? https://github.com/geyingli/unif/blob/master/uf/utils.py#L748

def average_n_grads(split_grads):
    split_grads = [grad for grad in split_grads if grad is not None]

    # Dealing with IndexedSlices for large-dimensional embedding
    # matrix. The gradient of an embedding matrix is not a tensor,
    # but a tuple-like object named `IndexedSlices`, for this one,
    # we need to take special processings.
    if split_grads[0].__str__().startswith('IndexedSlices'):
        all_values = [grad.values for grad in split_grads]

        values = tf.divide(tf.add_n(all_values), len(split_grads))
        indices = split_grads[0].indices
        dense_shape = split_grads[0].dense_shape

        return tf.IndexedSlices(
            values=values,
            indices=indices,
            dense_shape=dense_shape)
    return tf.divide(tf.add_n(split_grads), len(split_grads))

我尝试也直接用tf.divide(tf.add_n(split_grads), len(split_grads))来试试,结果在freelb中的如下代码中,会报grad.indices找不到的错误。我发现不返回IndexedSlices类型的话,grad返回的是一个clip类型的变量;用的话会返回IndexedSlices类型的变量,那么这样就能找到grad.indices。 https://github.com/geyingli/unif/blob/master/uf/processing.py#L483

r = tf.IndexedSlices(values=r,
                     indices=grad.indices,
                     dense_shape=grad.dense_shape)

以及此处代码创建的init_r变量的shape是[batch_size * max_seq_length,embedding_dim]的,使用单机四卡的话,grad.indices的shape应该为[batch_size / 4 * max_seq_length],但给的values的shape是[batch_size * max_seq_length]的,多了四倍,会报错。

init_r = tf.get_variable(
    'init_r',
    shape=[module.batch_size * module.max_seq_length,
           param.shape.as_list()[-1]],
    initializer=tf.random_uniform_initializer(
        minval=-epsilon, maxval=epsilon),
    trainable=False)

r = tf.IndexedSlices(values=r,
                     indices=grad.indices,
                     dense_shape=grad.dense_shape)
InvalidArgumentError (see above for traceback): data.shape = [4096,768] does not start with segment_ids.shape = [1024]
	 [[node add_1/y (defined at /root/unif-tencent/uf/processing.py:590)  = UnsortedSegmentSum[T=DT_FLOAT, Tindices=DT_INT32, Tnumsegments=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](truediv_202, bert/embeddings/Reshape/_457, add_1/strided_slice)]]
	 [[{{node Assign/_476}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2147_Assign", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

发现此问题的背景是我对freelb的实现方式做了点修改,如下 主要是不使用with_dependencies(),而是在加对embedding_table加完扰动后,生成一个新的attack_embedding_table,并用了tf.stop_gradient函数,来防止其对其他参数计算梯度时的影响。在计算前向传播的过程中,将attack_embedding_table作为一个参数传给module._parallel_forward()函数,来动态变换embedding_table。

    def _freelb(self, module, alpha=0.3, epsilon=0.3, n_loop=3, **kwargs):
        # FreeLB is similar to PGD, but uses average gradients from loop.
        # i.e. grad = (first_grad + ... + last_grad) / n_loop
        #
        # Also, it initializes the perturbation not from usual forward
        # propagation, but a collection of uniform distribution within
        # epsilon range. It does not uses actual gradient to average
        # gradients. The perturbation is iterated, in the same way with
        #  PGD.
        # (epsilon: the norm of perturbation, must be smaller than the
        # norm of gradients)

        # initialize
        (d_grads, module._losses, module._probs, module._preds) = \
            module._parallel_forward(**self._kwargs)
        grad, param = utils.get_grad_and_param(
            module.trainable_variables, d_grads, 'word_embedding')
        init_r = tf.get_variable(
            'init_r',
            shape=[module.batch_size * module.max_seq_length,
                   param.shape.as_list()[-1]],
            initializer=tf.random_uniform_initializer(
                minval=-epsilon, maxval=epsilon),
            trainable=False)
        init_op = tf.variables_initializer([init_r])
        with tf.control_dependencies([init_op]):    # fix perturbation
            # Scale randomly initialized permutation, to make sure norm
            # of `r` is smaller than epsilon.
            shape = tf.cast(np.prod(init_r.shape.as_list()), tf.float32)
            r = tf.divide(init_r, tf.sqrt(shape))
            r = tf.IndexedSlices(values=r,
                                 indices=grad.indices,
                                 dense_shape=grad.dense_shape)

        # with tf.control_dependencies([init_op]):    # fix perturbation
        #     # Scale randomly initialized permutation, to make sure norm
        #     # of `r` is smaller than epsilon.
        #     r = tf.divide(init_r, tf.norm(init_r, np.inf))
        #     r = tf.IndexedSlices(values=r,
        #                          indices=grad.indices,
        #                          dense_shape=grad.dense_shape)
        #     attack_op = param.assign(param + r)

        # attack
        acc_r = r
        all_grads = []
        for k in range(n_loop):
            attack_param = param + acc_r  ######修改部分
            attack_param = tf.stop_gradient(attack_param)  ######修改部分
            module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
                                                 module.trainable_variables]  ######修改部分
            (attack_grads, _, _, _) = \
                module._parallel_forward(attack_embeddings=attack_param, **self._kwargs)  ######修改部分
            all_grads.append(attack_grads)
            grad, _ = utils.get_grad_and_param(
                module.attack_trainable_variables,
                attack_grads, attack_param.name)
            tmp_r = tf.multiply(alpha, grad / (tf.norm(grad) + 1e-9))

            # In order not to shuffle the distribution of gradient-
            # induced perturbation, we use norm to scale instead of
            # simply clip the values.
            norm = tf.norm(acc_r + tmp_r)
            cur_r = tf.cond(norm > epsilon,
                            lambda: (acc_r + tmp_r) * tf.divide(epsilon, norm),
                            lambda: (acc_r + tmp_r))
            acc_r = cur_r

        attack_param = param + acc_r  ######修改部分
        attack_param = tf.stop_gradient(attack_param)  ######修改部分
        module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
                                             module.trainable_variables]  ######修改部分
        (attack_grads, _, _, _) = \
            module._parallel_forward(attack_embeddings=attack_param, **self._kwargs)  ######修改部分
        all_grads.append(attack_grads)

        # sum up
        grads = [utils.average_n_grads(split_grad) for split_grad in zip(
            *all_grads)]
        update_params_op = utils.update_global_params(
            module.trainable_variables, module._global_step,
            module._optimizer, grads)
        update_step_op = module._global_step.assign(module._global_step + 1)
        module._train_op = tf.group([update_params_op, update_step_op])

结果就会出现如下错误,主要就是说需要1024个值(),却给了4096个 我设的batch_size为128,单机四卡,max_seq_length为32 1024恰好为128/4 * 32, 4096恰好为128 * 32 所以我便产生了上面的疑惑,而当我对init_r的shape,设为[batch_size * max_seq_length / n_device,embedding_dim]时,是可以正常运行的。

出错位置:https://github.com/geyingli/unif/blob/master/uf/modeling/bert.py#L174

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 1024 values, but the requested shape has 4096
         [[node gradients_4/bert_4/embeddings/embedding_look_up_grad/Reshape_1 (defined at /jizhi/jizhi2/worker/trainer/uf/core.py:859)  = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert_16/embeddings/ExpandDims, gradients_16/bert_16/embeddings/embedding_look_up_grad/ExpandDims)]]
         [[{{node concat_2/_10269}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_124285_concat_2", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

yupeijei1997 avatar Apr 29 '21 14:04 yupeijei1997

peijie你好~ 节前最后的时间来回答一下你的疑惑。

针对上面第一个点,多卡梯度求均值的问题,我很诧异,这么明显的一个bug居然这么久我没能察觉到 (可能是因为我多GPU用的少)。我想是因为当时写的时候是临时这样写,为了兼容更低的tensorflow版本,后来跑成功了忘了替换成正确的写法。我已经把if那一段删掉了,可以在最新的commit看到变动。谢谢你帮我发现了一个致命bug。

至于第二个问题,其实是这样的。UNIF在跑到初始化r的时候,其实就已经完成了多卡的数据分发和梯度收集 (在_parallel_forward方法里,详见https://github.com/geyingli/unif/blob/master/uf/core.py#L758 ),所以这里的batch_size是没有问题的,上面陈述的改法自然会报错。

不知道是否有解答你的疑惑~

geyingli avatar Apr 30 '21 07:04 geyingli

我已将average_n_grads()同步为以下代码:

def average_n_grads(split_grads):
    split_grads = [grad for grad in split_grads if grad is not None]
    if len(split_grads) == 1:
        return split_grads[0]

    # Dealing with IndexedSlices for large-dimensional embedding
    # matrix. The gradient of an embedding matrix is not a tensor,
    # but a tuple-like object named `IndexedSlices`, for this one,
    # we need to take special processings.
    if split_grads[0].__str__().startswith('IndexedSlices'):

        values = tf.concat([grad.values for grad in split_grads], axis=0)
        indices = tf.concat([grad.indices for grad in split_grads], axis=0)
        dense_shape = split_grads[0].dense_shape
        
        return tf.IndexedSlices(
            values=values,
            indices=indices,
            dense_shape=dense_shape)

    return tf.divide(tf.add_n(split_grads), len(split_grads))

在单卡/多卡,以及各项对抗式训练的运行上都成功通过。具体到数值的校对从逻辑上我想是没必要了,但如果有时间,我们还是可以更细致地做一下

geyingli avatar Apr 30 '21 08:04 geyingli

完美解答了我的疑惑!

修改代码后我也成功运行了,代码逻辑很清晰,感觉是没问题的,感谢!

祝您假期愉快~

yupeijei1997 avatar Apr 30 '21 15:04 yupeijei1997