CS231
CS231 copied to clipboard
Full Connected Net gradient issue
When full connected net has more than 3 layers, the backporp gradient and numerical gradient show significant difference. This issue could be reproduced in the Dropout.ipynb
(in the cell Fully-connected nets with Dropout
).
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))
s = np.random.randint(1)
for dropout in [0, 0.25, 1.0]:
print 'Running check with dropout = ', dropout
model = FullyConnectedNet([H1, 10,10,10,10,10,H2], input_dim=D, num_classes=C,
weight_scale=5e-2, dtype=np.float64,
dropout=dropout, seed=s)
loss, grads = model.loss(X, y)
print 'Initial loss: ', loss
for name in sorted(grads):
f = lambda _: model.loss(X, y)[0]
grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))
print
The output of this would be:
Running check with dropout = 0
Initial loss: 2.30258505897
W1 relative error: 2.41e-03
W2 relative error: 1.21e-03
W3 relative error: 1.60e-03
W4 relative error: 2.15e-03
W5 relative error: 1.75e-03
W6 relative error: 2.10e-03
W7 relative error: 1.89e-03
W8 relative error: 1.37e-03
b1 relative error: 1.76e-03
b2 relative error: 1.69e-02
b3 relative error: 6.03e-01
b4 relative error: 1.00e+00
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 7.83e-11
Running check with dropout = 0.25
We use dropout with p =0.250000
Initial loss: 2.30258509299
W1 relative error: 0.00e+00
W2 relative error: 0.00e+00
W3 relative error: 0.00e+00
W4 relative error: 0.00e+00
W5 relative error: 0.00e+00
W6 relative error: 0.00e+00
W7 relative error: 0.00e+00
W8 relative error: 0.00e+00
b1 relative error: 0.00e+00
b2 relative error: 0.00e+00
b3 relative error: 0.00e+00
b4 relative error: 0.00e+00
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 6.99e-11
Running check with dropout = 1.0
We use dropout with p =1.000000
Initial loss: 2.30258510213
W1 relative error: 3.55e-03
W2 relative error: 2.40e-03
W3 relative error: 2.44e-03
W4 relative error: 1.94e-03
W5 relative error: 1.98e-03
W6 relative error: 1.89e-03
W7 relative error: 2.13e-03
W8 relative error: 2.68e-03
b1 relative error: 2.36e-03
b2 relative error: 6.30e-04
b3 relative error: 7.33e-02
b4 relative error: 2.98e-01
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 1.44e-10
I have tried several random seeds on this and it seems the bias gradient on the last layer are always correct. And the bias error will be extremely large since last hidden layer. However, error on W seems to be correct all the time. I firstly noticed this weird thing on my own implementation and it seems that the same thing occurs in your implementation. Any ideas?