transformer
transformer copied to clipboard
Wrong Batch Normalization
In function normalize() `
with tf.variable_scope(scope, reuse=reuse):
inputs_shape = inputs.get_shape()
params_shape = inputs_shape[-1:]
mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
print ('mean.get_shape()',mean.get_shape())
beta= tf.Variable(tf.zeros(params_shape))
gamma = tf.Variable(tf.ones(params_shape))
normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
outputs = gamma * normalized + beta
` but i think the second parameter of tf.nn.moments() should not be [-1], since we need to consider the batch information. After modification the code shown as below:
`
with tf.variable_scope(scope, reuse=reuse):
inputs_shape = inputs.get_shape()
params_shape = inputs_shape[-1:]
axis = list(range(len(inputs_shape) - 1))
mean, variance = tf.nn.moments(inputs, axis, keep_dims=True)
print ('mean.get_shape()',mean.get_shape())
beta= tf.Variable(tf.zeros(params_shape))
gamma = tf.Variable(tf.ones(params_shape))
normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
outputs = gamma * normalized + beta
`
Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2
Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2
However, I doubt that the implementation of layer_norm
is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code as axis = list(range(1, len(inputs_shape)))
Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2
However, I doubt that the implementation of
layer_norm
is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code asaxis = list(range(1, len(inputs_shape)))
Agree. In this reposity, the author just use a way that is just related to last dim, not related to sequence length direction. But in layer normalization cases, generally speaking, it should be just batch irrelevant.