unif
unif copied to clipboard
单机多卡
您好~
在使用unif的过程中,对下面这个函数有点疑惑,您用空的时候看看哈~
如下函数求梯度的平均值时,如果grad是IndexedSlices类型的话,对value求平均,而indices则取第一个grad的indices; 感觉每个grad的indices是不一样的,假如是四卡的情况,一个batch被分成四分,其数据是不一样的,那取得应该是embedding_table矩阵的不同行;
这样的话,直接取第一个grad的indices作为indices感觉漏掉了embedding_table里一些参数的梯度;这里的value直接取平均的话,意思是把embedding_table里不同batch里的不同行的梯度值进行平均,感觉是不同参数的梯度值取了平均,直觉上是相同参数的梯度值取平均,所以感觉有些奇怪。看网上有的单机多卡的梯度平均实现是,不管是不是IndexedSlices类型,都直接用tf.divide(tf.add_n(split_grads), len(split_grads))来求平均,也不知道这样能解决我说的疑惑嘛? https://github.com/geyingli/unif/blob/master/uf/utils.py#L748
def average_n_grads(split_grads):
split_grads = [grad for grad in split_grads if grad is not None]
# Dealing with IndexedSlices for large-dimensional embedding
# matrix. The gradient of an embedding matrix is not a tensor,
# but a tuple-like object named `IndexedSlices`, for this one,
# we need to take special processings.
if split_grads[0].__str__().startswith('IndexedSlices'):
all_values = [grad.values for grad in split_grads]
values = tf.divide(tf.add_n(all_values), len(split_grads))
indices = split_grads[0].indices
dense_shape = split_grads[0].dense_shape
return tf.IndexedSlices(
values=values,
indices=indices,
dense_shape=dense_shape)
return tf.divide(tf.add_n(split_grads), len(split_grads))
我尝试也直接用tf.divide(tf.add_n(split_grads), len(split_grads))来试试,结果在freelb中的如下代码中,会报grad.indices找不到的错误。我发现不返回IndexedSlices类型的话,grad返回的是一个clip类型的变量;用的话会返回IndexedSlices类型的变量,那么这样就能找到grad.indices。 https://github.com/geyingli/unif/blob/master/uf/processing.py#L483
r = tf.IndexedSlices(values=r,
indices=grad.indices,
dense_shape=grad.dense_shape)
以及此处代码创建的init_r变量的shape是[batch_size * max_seq_length,embedding_dim]的,使用单机四卡的话,grad.indices的shape应该为[batch_size / 4 * max_seq_length],但给的values的shape是[batch_size * max_seq_length]的,多了四倍,会报错。
init_r = tf.get_variable(
'init_r',
shape=[module.batch_size * module.max_seq_length,
param.shape.as_list()[-1]],
initializer=tf.random_uniform_initializer(
minval=-epsilon, maxval=epsilon),
trainable=False)
r = tf.IndexedSlices(values=r,
indices=grad.indices,
dense_shape=grad.dense_shape)
InvalidArgumentError (see above for traceback): data.shape = [4096,768] does not start with segment_ids.shape = [1024]
[[node add_1/y (defined at /root/unif-tencent/uf/processing.py:590) = UnsortedSegmentSum[T=DT_FLOAT, Tindices=DT_INT32, Tnumsegments=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](truediv_202, bert/embeddings/Reshape/_457, add_1/strided_slice)]]
[[{{node Assign/_476}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2147_Assign", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
发现此问题的背景是我对freelb的实现方式做了点修改,如下 主要是不使用with_dependencies(),而是在加对embedding_table加完扰动后,生成一个新的attack_embedding_table,并用了tf.stop_gradient函数,来防止其对其他参数计算梯度时的影响。在计算前向传播的过程中,将attack_embedding_table作为一个参数传给module._parallel_forward()函数,来动态变换embedding_table。
def _freelb(self, module, alpha=0.3, epsilon=0.3, n_loop=3, **kwargs):
# FreeLB is similar to PGD, but uses average gradients from loop.
# i.e. grad = (first_grad + ... + last_grad) / n_loop
#
# Also, it initializes the perturbation not from usual forward
# propagation, but a collection of uniform distribution within
# epsilon range. It does not uses actual gradient to average
# gradients. The perturbation is iterated, in the same way with
# PGD.
# (epsilon: the norm of perturbation, must be smaller than the
# norm of gradients)
# initialize
(d_grads, module._losses, module._probs, module._preds) = \
module._parallel_forward(**self._kwargs)
grad, param = utils.get_grad_and_param(
module.trainable_variables, d_grads, 'word_embedding')
init_r = tf.get_variable(
'init_r',
shape=[module.batch_size * module.max_seq_length,
param.shape.as_list()[-1]],
initializer=tf.random_uniform_initializer(
minval=-epsilon, maxval=epsilon),
trainable=False)
init_op = tf.variables_initializer([init_r])
with tf.control_dependencies([init_op]): # fix perturbation
# Scale randomly initialized permutation, to make sure norm
# of `r` is smaller than epsilon.
shape = tf.cast(np.prod(init_r.shape.as_list()), tf.float32)
r = tf.divide(init_r, tf.sqrt(shape))
r = tf.IndexedSlices(values=r,
indices=grad.indices,
dense_shape=grad.dense_shape)
# with tf.control_dependencies([init_op]): # fix perturbation
# # Scale randomly initialized permutation, to make sure norm
# # of `r` is smaller than epsilon.
# r = tf.divide(init_r, tf.norm(init_r, np.inf))
# r = tf.IndexedSlices(values=r,
# indices=grad.indices,
# dense_shape=grad.dense_shape)
# attack_op = param.assign(param + r)
# attack
acc_r = r
all_grads = []
for k in range(n_loop):
attack_param = param + acc_r ######修改部分
attack_param = tf.stop_gradient(attack_param) ######修改部分
module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
module.trainable_variables] ######修改部分
(attack_grads, _, _, _) = \
module._parallel_forward(attack_embeddings=attack_param, **self._kwargs) ######修改部分
all_grads.append(attack_grads)
grad, _ = utils.get_grad_and_param(
module.attack_trainable_variables,
attack_grads, attack_param.name)
tmp_r = tf.multiply(alpha, grad / (tf.norm(grad) + 1e-9))
# In order not to shuffle the distribution of gradient-
# induced perturbation, we use norm to scale instead of
# simply clip the values.
norm = tf.norm(acc_r + tmp_r)
cur_r = tf.cond(norm > epsilon,
lambda: (acc_r + tmp_r) * tf.divide(epsilon, norm),
lambda: (acc_r + tmp_r))
acc_r = cur_r
attack_param = param + acc_r ######修改部分
attack_param = tf.stop_gradient(attack_param) ######修改部分
module.attack_trainable_variables = [attack_param if v.name == 'bert/embeddings/word_embeddings:0' else v for v in
module.trainable_variables] ######修改部分
(attack_grads, _, _, _) = \
module._parallel_forward(attack_embeddings=attack_param, **self._kwargs) ######修改部分
all_grads.append(attack_grads)
# sum up
grads = [utils.average_n_grads(split_grad) for split_grad in zip(
*all_grads)]
update_params_op = utils.update_global_params(
module.trainable_variables, module._global_step,
module._optimizer, grads)
update_step_op = module._global_step.assign(module._global_step + 1)
module._train_op = tf.group([update_params_op, update_step_op])
结果就会出现如下错误,主要就是说需要1024个值(),却给了4096个 我设的batch_size为128,单机四卡,max_seq_length为32 1024恰好为128/4 * 32, 4096恰好为128 * 32 所以我便产生了上面的疑惑,而当我对init_r的shape,设为[batch_size * max_seq_length / n_device,embedding_dim]时,是可以正常运行的。
出错位置:https://github.com/geyingli/unif/blob/master/uf/modeling/bert.py#L174
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 1024 values, but the requested shape has 4096
[[node gradients_4/bert_4/embeddings/embedding_look_up_grad/Reshape_1 (defined at /jizhi/jizhi2/worker/trainer/uf/core.py:859) = Reshape[T=DT_INT32, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert_16/embeddings/ExpandDims, gradients_16/bert_16/embeddings/embedding_look_up_grad/ExpandDims)]]
[[{{node concat_2/_10269}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_124285_concat_2", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
peijie你好~ 节前最后的时间来回答一下你的疑惑。
针对上面第一个点,多卡梯度求均值的问题,我很诧异,这么明显的一个bug居然这么久我没能察觉到 (可能是因为我多GPU用的少)。我想是因为当时写的时候是临时这样写,为了兼容更低的tensorflow版本,后来跑成功了忘了替换成正确的写法。我已经把if那一段删掉了,可以在最新的commit看到变动。谢谢你帮我发现了一个致命bug。
至于第二个问题,其实是这样的。UNIF在跑到初始化r的时候,其实就已经完成了多卡的数据分发和梯度收集 (在_parallel_forward方法里,详见https://github.com/geyingli/unif/blob/master/uf/core.py#L758 ),所以这里的batch_size是没有问题的,上面陈述的改法自然会报错。
不知道是否有解答你的疑惑~
我已将average_n_grads()同步为以下代码:
def average_n_grads(split_grads):
split_grads = [grad for grad in split_grads if grad is not None]
if len(split_grads) == 1:
return split_grads[0]
# Dealing with IndexedSlices for large-dimensional embedding
# matrix. The gradient of an embedding matrix is not a tensor,
# but a tuple-like object named `IndexedSlices`, for this one,
# we need to take special processings.
if split_grads[0].__str__().startswith('IndexedSlices'):
values = tf.concat([grad.values for grad in split_grads], axis=0)
indices = tf.concat([grad.indices for grad in split_grads], axis=0)
dense_shape = split_grads[0].dense_shape
return tf.IndexedSlices(
values=values,
indices=indices,
dense_shape=dense_shape)
return tf.divide(tf.add_n(split_grads), len(split_grads))
在单卡/多卡,以及各项对抗式训练的运行上都成功通过。具体到数值的校对从逻辑上我想是没必要了,但如果有时间,我们还是可以更细致地做一下
完美解答了我的疑惑!
修改代码后我也成功运行了,代码逻辑很清晰,感觉是没问题的,感谢!
祝您假期愉快~