transformer icon indicating copy to clipboard operation
transformer copied to clipboard

Wrong Batch Normalization

Open bryant03 opened this issue 6 years ago • 3 comments

In function normalize() `

    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

` but i think the second parameter of tf.nn.moments() should not be [-1], since we need to consider the batch information. After modification the code shown as below:

`

 with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        axis = list(range(len(inputs_shape) - 1))
        mean, variance = tf.nn.moments(inputs, axis, keep_dims=True)
        print ('mean.get_shape()',mean.get_shape())
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta

`

bryant03 avatar Jul 11 '18 10:07 bryant03

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

RayXu14 avatar Sep 19 '18 03:09 RayXu14

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

However, I doubt that the implementation of layer_norm is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code as axis = list(range(1, len(inputs_shape)))

RoyJoyRo avatar Mar 25 '19 13:03 RoyJoyRo

Transformer use Layer Normalization rather than batch normalization. Layer Normalization need not consider the batch information. see Layer Normalization at the end of page 2

However, I doubt that the implementation of layer_norm is still wrong. The paper is not describing the model clearly. I suggest referencing to the implementation in layers.py, and change the code as axis = list(range(1, len(inputs_shape)))

Agree. In this reposity, the author just use a way that is just related to last dim, not related to sequence length direction. But in layer normalization cases, generally speaking, it should be just batch irrelevant.

RayXu14 avatar Mar 28 '19 12:03 RayXu14