mxnet
mxnet copied to clipboard
cuDNN batchnorm behaviour is not consistent and it can output nan
This script creates a batchnorm and runs it 3 times:
- A first test mode evaluation
- A dummy training mode evaluation
- A second test mode evaluation
The outputs of (1) and (3) are compared under various circumstances: CPU vs GPU, cudnn batchnorm ON vs OFF, evaluation (2) with vs without backward pass.
import mxnet as mx
import numpy as np
from mxnet import autograd
def testStateChange(backward, device, cudnn):
print()
print('backward: ' + str(backward) + ', device: ' + str(device) + ', cudnn: ' + str(cudnn))
sym = mx.symbol.BatchNorm(
*[mx.symbol.Variable(name) for name in shapes.keys()],
eps=0.001,
fix_gamma=False,
use_global_stats=False,
axis=1,
cudnn_off=not(cudnn)
)
op = mx.ndarray.CachedOp(sym)
if(device == mx.cpu()):
arguments = args_cpu
else:
arguments = args_gpu
# First evaluation in test mode
out1 = op(*arguments, default_ctx=device)
# Dummy evaluation in training mode, with or without backward
if(backward):
with autograd.record(train_mode=True):
[arg.attach_grad() for arg in arguments]
dummy = op(*arguments, default_ctx=device)
autograd.backward(dummy, head_grads=mx.np.ones([1, 2, 3], ctx=device))
else:
with autograd.train_mode():
op(*arguments, default_ctx=device)
# Second evaluation in test mode
out2 = op(*arguments, default_ctx=device)
if(np.isnan(np.sum(out1.asnumpy()))):
print('out1 has nans!')
if(np.isnan(np.sum(out2.asnumpy()))):
print('out2 has nans!')
# Check if the dummy evaluation in training mode has changed the state of the
# batchnorm. If out1 and out2 are different, the state was changed
print(mx.np.max(mx.np.abs(out1 - out2)))
print("**** cudnn batchnorm inconsistency")
shapes = {'input': [1, 2, 3], 'gamma': [2], 'beta': [2], 'mean': [2], 'var': [2]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
testStateChange(False, mx.cpu(), False)
testStateChange(True, mx.cpu(), False)
testStateChange(False, mx.gpu(), False)
testStateChange(True, mx.gpu(), False)
testStateChange(False, mx.gpu(), True)
testStateChange(True, mx.gpu(), True)
print("\n\n**** cudnn batchnorm nan")
shapes = {'input': [1, 6], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': [6]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]
testStateChange(False, mx.gpu(), True)
I get this output from the above script:
**** cudnn batchnorm inconsistency
backward: False, device: cpu(0), cudnn: False
0.0
backward: True, device: cpu(0), cudnn: False
0.045242727
backward: False, device: gpu(0), cudnn: False
0.0
backward: True, device: gpu(0), cudnn: False
0.045242667
backward: False, device: gpu(0), cudnn: True
0.044606388
backward: True, device: gpu(0), cudnn: True
0.043622255
**** cudnn batchnorm nan
backward: False, device: gpu(0), cudnn: True
out2 has nans!
nan
This shows 2 problems:
- The dummy training mode evaluation can change the values of the moving mean and variance thus making out1 and out2 differ sometimes, but it is inconsistent in doing so. The "cudnn batchnorm inconsistency" outputs shows that moving arrays are normally changed only if a BACKWARD pass in training mode is performed, but on GPU + cudnn they are changed by the FORWARD (case
backward: False, device: gpu(0), cudnn: True) - The "cudnn batchnorm nan" output shows that the cudnn batchnorm can also output nan when alternating training and test mode evaluations with certain input shapes
Let me suggest a few things that may be involved in these results:
- The BatchNorm implementations may not update the moving mean and variance at the same time. Some might do it during training-forward, while others training-backward. This is OK in my mind and shouldn't affect the defined use case where training-backward always follows training-forward.
- The beta and gamma are learned parameters, right? So they will change with a training iteration and affect subsequent inference outputs.
- Regarding the nan test: I wasn't aware that a 2D input [1, 6] was even supported. But if this is indeed supported, is it equivalent to [1, 6, 1]? A Batchnorm performed over 1 element might be problematic. The cudnn moving variance is unbiased, which means it has had a m / (m-1) factor applied to the population variance. For example, see: https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_numpy_op.py#L1877-L1880
@DickJC123
- Training would obviously be the same, but there is a corner case when one might want to perform several forward passes in training mode without doing backwards. In this case the cudnn implementation would behave differently than the default GPU one and the CPU one as well
- Yes beta and gamma are changed by the optimizer, not by the code in these examples. There is nothing wrong with them
- Yes, it's supposed to be equivalent to [1, 6, 1]. So the fact that the cudnn variance is unbiased seems to explain the numerical error. I will make a few tests of this
@DickJC123 you were in fact right about biased vs unbiased variance computation. This script tests such claim by letting a non-cudnn batchnorm and a cudnn-batchnorm update their moving variance, and checking that they are updated differently and that they respectively correspond to the biased (non-cudnn) and the unbiased (cudnn) computations:
import mxnet as mx
import numpy as np
from mxnet import autograd
print("**** cudnn batchnorm variance")
shapes = {'input': [1, 6, 5], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': [6]}
# Define batchnorms with identical specs except cudnn_off
# Note that momentum is 0, so moving arrays are replaced everytime with the latest one
sym1 = mx.symbol.BatchNorm(
*[mx.symbol.Variable(name) for name in shapes.keys()],
eps=0.001,
momentum=0,
fix_gamma=False,
use_global_stats=False,
axis=1,
cudnn_off=True
)
sym2 = mx.symbol.BatchNorm(
*[mx.symbol.Variable(name) for name in shapes.keys()],
eps=0.001,
momentum=0,
fix_gamma=False,
use_global_stats=False,
axis=1,
cudnn_off=False
)
op1 = mx.ndarray.CachedOp(sym1)
op2 = mx.ndarray.CachedOp(sym2)
# Define arrays for op1 and
# They are identical now, but they will be changed differently by the ops
args1 = [mx.np.random.uniform(size=shape, ctx=mx.gpu()) for shape in shapes.values()]
args2 = [mx.np.array(array, ctx=mx.gpu()) for array in args1]
data, gamma, beta, mean, var = args1
# Evaluation in training mode with backward that rewrites moving mean and var
with autograd.record(train_mode=True):
[arg.attach_grad() for arg in args1]
[arg.attach_grad() for arg in args2]
dummy1 = op1(*args1, default_ctx=mx.gpu())
dummy2 = op2(*args2, default_ctx=mx.gpu())
autograd.backward(dummy1, head_grads=mx.np.ones(shapes['input'], ctx=mx.gpu()))
autograd.backward(dummy2, head_grads=mx.np.ones(shapes['input'], ctx=mx.gpu()))
# Check that outputs are the same
print()
print("difference between training mode outputs")
print(mx.np.max(mx.np.abs(dummy1 - dummy2)))
# Check updated moving vars and observe they are different
print()
print("variance updated by the non-cudnn batchnorm")
print(args1[-1])
print("variance updated by the cudnn batchnorm")
print(args2[-1])
# Manually compute biased and unbiased variance
data_mean = mx.np.mean(data, axis=(-1))
data_zeromean = data - data_mean[:, :, np.newaxis]
var1 = mx.np.mean((data_zeromean * data_zeromean), axis=(-1))
var2 = var1 * shapes['input'][-1] / (shapes['input'][-1] - 1)
print()
print("manual biased variance")
print(var1)
print("manual unbiased variance")
print(var2)
output is:
**** cudnn batchnorm variance
difference between training mode outputs
2.3841858e-07
variance updated by the non-cudnn batchnorm
[0.12171984 0.03338415 0.03920404 0.04988261 0.02153183 0.02420242] @gpu(0)
variance updated by the cudnn batchnorm
[0.15214981 0.04173018 0.04900505 0.06235326 0.02691478 0.03025302] @gpu(0)
manual biased variance
[[0.12171984 0.03338414 0.03920404 0.04988261 0.02153182 0.02420242]] @gpu(0)
manual unbiased variance
[[0.1521498 0.04173018 0.04900505 0.06235326 0.02691478 0.03025302]] @gpu(0)
So this shows that:
- The training mode output is the same between non-cudnn and cudnn implementations ("difference between training mode outputs"), so they are computing the data variance in the same way at this step.It can be checked manually that their result corresponds to using the biased variance
- However the way the end up changing their moving variance is different. In particular, the non-cudnn case uses the biased variance as before but the cudnn case uses the non-biased variance this time. Note that the momentum is set to 0 for both ops, which means that moving arrays are replaced with the latest ones, that makes it easy to check the results
- This explains the numerical error found in my original report. For a spatial size of 1, the unbiased variance gets multiplied by a factor 1 / (1 - 1) = nan which would make a subsequent evaluation fail for the cudnn case
So, to summarize the issues I found with the cuDNN implementation:
- Moving arrays are normally updated only if a BACKWARD pass in training mode is performed, but on GPU + cudnn they are changed by the FORWARD
- In training mode, all implementations compute the biased data variance during the forward but the cuDNN implementation uses the unbiased data variance to update the moving variance
So the cuDNN implementation updates the moving variance using a different value (the unbiased one) and also at a different time (during the forward)