mxnet icon indicating copy to clipboard operation
mxnet copied to clipboard

cuDNN batchnorm behaviour is not consistent and it can output nan

Open matteosal opened this issue 3 years ago • 4 comments

This script creates a batchnorm and runs it 3 times:

  1. A first test mode evaluation
  2. A dummy training mode evaluation
  3. A second test mode evaluation

The outputs of (1) and (3) are compared under various circumstances: CPU vs GPU, cudnn batchnorm ON vs OFF, evaluation (2) with vs without backward pass.

import mxnet as mx
import numpy as np
from mxnet import autograd

def testStateChange(backward, device, cudnn):
	print()
	print('backward: ' + str(backward) + ', device: ' + str(device) + ', cudnn: ' + str(cudnn))
	sym = mx.symbol.BatchNorm(
		*[mx.symbol.Variable(name) for name in shapes.keys()],
		eps=0.001,
		fix_gamma=False,
		use_global_stats=False,
		axis=1,
		cudnn_off=not(cudnn)
	)
	op = mx.ndarray.CachedOp(sym)

	if(device == mx.cpu()):
		arguments = args_cpu
	else:
		arguments = args_gpu

	# First evaluation in test mode
	out1 = op(*arguments, default_ctx=device)

	# Dummy evaluation in training mode, with or without backward
	if(backward):
		with autograd.record(train_mode=True):
			[arg.attach_grad() for arg in arguments]
			dummy = op(*arguments, default_ctx=device)
		autograd.backward(dummy, head_grads=mx.np.ones([1, 2, 3], ctx=device))
	else:
		with autograd.train_mode():
			op(*arguments, default_ctx=device)

	# Second evaluation in test mode
	out2 = op(*arguments, default_ctx=device)

	if(np.isnan(np.sum(out1.asnumpy()))):
		print('out1 has nans!')
	if(np.isnan(np.sum(out2.asnumpy()))):
		print('out2 has nans!')

	# Check if the dummy evaluation in training mode has changed the state of the 
	# batchnorm. If out1 and out2 are different, the state was changed
	print(mx.np.max(mx.np.abs(out1 - out2)))	

print("**** cudnn batchnorm inconsistency")

shapes = {'input': [1, 2, 3], 'gamma': [2], 'beta': [2], 'mean': [2], 'var': [2]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]

testStateChange(False, mx.cpu(), False)
testStateChange(True, mx.cpu(), False)

testStateChange(False, mx.gpu(), False)
testStateChange(True, mx.gpu(), False)

testStateChange(False, mx.gpu(), True)
testStateChange(True, mx.gpu(), True)

print("\n\n**** cudnn batchnorm nan")

shapes = {'input': [1, 6], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': [6]}
args_cpu = [mx.np.random.uniform(size=shape, ctx=mx.cpu()) for shape in shapes.values()]
args_gpu = [mx.np.array(array, ctx=mx.gpu()) for array in args_cpu]

testStateChange(False, mx.gpu(), True)

I get this output from the above script:

**** cudnn batchnorm inconsistency

backward: False, device: cpu(0), cudnn: False
0.0

backward: True, device: cpu(0), cudnn: False
0.045242727

backward: False, device: gpu(0), cudnn: False
0.0

backward: True, device: gpu(0), cudnn: False
0.045242667

backward: False, device: gpu(0), cudnn: True
0.044606388

backward: True, device: gpu(0), cudnn: True
0.043622255


**** cudnn batchnorm nan

backward: False, device: gpu(0), cudnn: True
out2 has nans!
nan

This shows 2 problems:

  1. The dummy training mode evaluation can change the values of the moving mean and variance thus making out1 and out2 differ sometimes, but it is inconsistent in doing so. The "cudnn batchnorm inconsistency" outputs shows that moving arrays are normally changed only if a BACKWARD pass in training mode is performed, but on GPU + cudnn they are changed by the FORWARD (case backward: False, device: gpu(0), cudnn: True)
  2. The "cudnn batchnorm nan" output shows that the cudnn batchnorm can also output nan when alternating training and test mode evaluations with certain input shapes

matteosal avatar Aug 01 '22 14:08 matteosal

Let me suggest a few things that may be involved in these results:

  • The BatchNorm implementations may not update the moving mean and variance at the same time. Some might do it during training-forward, while others training-backward. This is OK in my mind and shouldn't affect the defined use case where training-backward always follows training-forward.
  • The beta and gamma are learned parameters, right? So they will change with a training iteration and affect subsequent inference outputs.
  • Regarding the nan test: I wasn't aware that a 2D input [1, 6] was even supported. But if this is indeed supported, is it equivalent to [1, 6, 1]? A Batchnorm performed over 1 element might be problematic. The cudnn moving variance is unbiased, which means it has had a m / (m-1) factor applied to the population variance. For example, see: https://github.com/apache/incubator-mxnet/blob/master/tests/python/unittest/test_numpy_op.py#L1877-L1880

DickJC123 avatar Aug 03 '22 17:08 DickJC123

@DickJC123

  1. Training would obviously be the same, but there is a corner case when one might want to perform several forward passes in training mode without doing backwards. In this case the cudnn implementation would behave differently than the default GPU one and the CPU one as well
  2. Yes beta and gamma are changed by the optimizer, not by the code in these examples. There is nothing wrong with them
  3. Yes, it's supposed to be equivalent to [1, 6, 1]. So the fact that the cudnn variance is unbiased seems to explain the numerical error. I will make a few tests of this

matteosal avatar Aug 05 '22 12:08 matteosal

@DickJC123 you were in fact right about biased vs unbiased variance computation. This script tests such claim by letting a non-cudnn batchnorm and a cudnn-batchnorm update their moving variance, and checking that they are updated differently and that they respectively correspond to the biased (non-cudnn) and the unbiased (cudnn) computations:

import mxnet as mx
import numpy as np
from mxnet import autograd

print("**** cudnn batchnorm variance")

shapes = {'input': [1, 6, 5], 'gamma': [6], 'beta': [6], 'mean': [6], 'var': [6]}

# Define batchnorms with identical specs except cudnn_off
# Note that momentum is 0, so moving arrays are replaced everytime with the latest one
sym1 = mx.symbol.BatchNorm(
	*[mx.symbol.Variable(name) for name in shapes.keys()],
	eps=0.001,
	momentum=0,
	fix_gamma=False,
	use_global_stats=False,
	axis=1,
	cudnn_off=True
)
sym2 = mx.symbol.BatchNorm(
	*[mx.symbol.Variable(name) for name in shapes.keys()],
	eps=0.001,
	momentum=0,
	fix_gamma=False,
	use_global_stats=False,
	axis=1,
	cudnn_off=False
)
op1 = mx.ndarray.CachedOp(sym1)
op2 = mx.ndarray.CachedOp(sym2)

# Define arrays for op1 and 
# They are identical now, but they will be changed differently by the ops
args1 = [mx.np.random.uniform(size=shape, ctx=mx.gpu()) for shape in shapes.values()]
args2 = [mx.np.array(array, ctx=mx.gpu()) for array in args1]

data, gamma, beta, mean, var = args1

# Evaluation in training mode with backward that rewrites moving mean and var
with autograd.record(train_mode=True):
	[arg.attach_grad() for arg in args1]
	[arg.attach_grad() for arg in args2]
	dummy1 = op1(*args1, default_ctx=mx.gpu())
	dummy2 = op2(*args2, default_ctx=mx.gpu())
autograd.backward(dummy1, head_grads=mx.np.ones(shapes['input'], ctx=mx.gpu()))
autograd.backward(dummy2, head_grads=mx.np.ones(shapes['input'], ctx=mx.gpu()))

# Check that outputs are the same
print()
print("difference between training mode outputs")
print(mx.np.max(mx.np.abs(dummy1 - dummy2)))	

# Check updated moving vars and observe they are different
print()
print("variance updated by the non-cudnn batchnorm")
print(args1[-1])
print("variance updated by the cudnn batchnorm")
print(args2[-1])

# Manually compute biased and unbiased variance
data_mean = mx.np.mean(data, axis=(-1))
data_zeromean = data - data_mean[:, :, np.newaxis]
var1 = mx.np.mean((data_zeromean * data_zeromean), axis=(-1))
var2 = var1 * shapes['input'][-1] / (shapes['input'][-1] - 1)

print()
print("manual biased variance")
print(var1)
print("manual unbiased variance")
print(var2)

output is:

**** cudnn batchnorm variance

difference between training mode outputs
2.3841858e-07

variance updated by the non-cudnn batchnorm
[0.12171984 0.03338415 0.03920404 0.04988261 0.02153183 0.02420242] @gpu(0)
variance updated by the cudnn batchnorm
[0.15214981 0.04173018 0.04900505 0.06235326 0.02691478 0.03025302] @gpu(0)

manual biased variance
[[0.12171984 0.03338414 0.03920404 0.04988261 0.02153182 0.02420242]] @gpu(0)
manual unbiased variance
[[0.1521498  0.04173018 0.04900505 0.06235326 0.02691478 0.03025302]] @gpu(0)

So this shows that:

  1. The training mode output is the same between non-cudnn and cudnn implementations ("difference between training mode outputs"), so they are computing the data variance in the same way at this step.It can be checked manually that their result corresponds to using the biased variance
  2. However the way the end up changing their moving variance is different. In particular, the non-cudnn case uses the biased variance as before but the cudnn case uses the non-biased variance this time. Note that the momentum is set to 0 for both ops, which means that moving arrays are replaced with the latest ones, that makes it easy to check the results
  3. This explains the numerical error found in my original report. For a spatial size of 1, the unbiased variance gets multiplied by a factor 1 / (1 - 1) = nan which would make a subsequent evaluation fail for the cudnn case

matteosal avatar Sep 02 '22 16:09 matteosal

So, to summarize the issues I found with the cuDNN implementation:

  1. Moving arrays are normally updated only if a BACKWARD pass in training mode is performed, but on GPU + cudnn they are changed by the FORWARD
  2. In training mode, all implementations compute the biased data variance during the forward but the cuDNN implementation uses the unbiased data variance to update the moving variance

So the cuDNN implementation updates the moving variance using a different value (the unbiased one) and also at a different time (during the forward)

matteosal avatar Sep 02 '22 16:09 matteosal