transformer Wrong Batch Normalization

In function normalize() `

    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

` but i think the second parameter of tf.nn.moments() should not be [-1], since we need to consider the batch information. After modification the code shown as below:

`

 with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        axis = list(range(len(inputs_shape) - 1))
        mean, variance = tf.nn.moments(inputs, axis, keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

`

Jul 11 '18 10:07 bryant03

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

Sep 19 '18 03:09 RayXu14

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

However, I doubt that the implementation of layer_norm is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code as axis = list(range(1, len(inputs_shape)))

Mar 25 '19 13:03 RoyJoyRo

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

However, I doubt that the implementation of layer_norm is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code as axis = list(range(1, len(inputs_shape)))

Agree. In this reposity, the author just use a way that is just related to last dim, not related to sequence length direction. But in layer normalization cases, generally speaking, it should be just batch irrelevant.

Mar 28 '19 12:03 RayXu14

transformer transformer copied to clipboard

Wrong Batch Normalization

transformer
transformer copied to clipboard