Grokking-Deep-Learning
Grokking-Deep-Learning copied to clipboard
Ch11 Predicting Movie Reviews - error in back propagation code
There seems to be small mistake in the Predicting Movie review code. Here is the code
x,y = (input_dataset[i],target_dataset[i])
layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) #embed + sigmoid
layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) # linear + softmax
layer_2_delta = layer_2 - y # compare pred with truth
layer_1_delta = layer_2_delta.dot(weights_1_2.T) #backprop
weights_0_1[x] -= layer_1_delta * alpha
weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
Error In the forward pass, the code apples sigmoid activation function. Therefore when we calculate layer_1_delta - should we not multiple with derivative of sigmoid? My understanding was that either we should not apply sigmoid function on layer_1. If we are applying the sigmoid function then in backprop we should multiply with its derivatives.
I have noticed this too and attempted this code for the weight updates:
def sigmoidd(x):
s = sigmoid(x)
return s * (1 - s)
dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2)))
weights_1_2 -= dw12
weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates
This converges much slower, but the similarity comparisons seem to be a better fit.
EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-... Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.
W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1
W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]
I have the same question. I am really a newbie of deep learning. Here is my thought. The most important thing of back prop algorithm is giving previous layer up or down pressure based on the delta. The back prop algorithm can still work without the derivative item. I am not sure how big the impact will be.
My guessing is the author tried the correct one with derivative, but soon the author realized this example worked better without derivative. The same explanation applies to chapter 9. The derivative of softmax is not 1/(batch_size * layer_2.shape[0])
.
I have noticed this too and attempted this code for the weight updates:
def sigmoidd(x): s = sigmoid(x) return s * (1 - s) dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2))) weights_1_2 -= dw12 weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates
This converges much slower, but the similarity comparisons seem to be a better fit.
EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-... Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.
W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1 W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]
Do we really need to consider Taylor series
? Wouldn't it make things more complicated?
For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.
For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.
Nice insight! May I ask have you ever encountered any problem in chapter 9? How can the author come up the 1/(batch_size * layer_2.shape[0])
term?
I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.
I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.
How could temp = (output - true)
and output = temp/len(true)
become 1/(batch_size * layer_2.shape[0])
. BTW, what is the term true
?
The book is great besides those confusing code fragments. I am mainly an infrastructure engineer so I just skip anything I cannot understand then focus on something I could figure out. It's kinda glad to talk someone else who also stuck part of the book.
I like it too, overall, despite the bothersome issues, gave it a good Amazon review. true
is the truth, I assume, i.e. the value we are training against. Poor choice, agreed, that is even a reserved word in Python. Well, this is what I don't get either, I can the length (this comes from the exponentials in the softmax raised to 0 = 1. But the exponents for the 1s would be e and so really not sure there). layer_2.shape[0]
also gives the length along axis 0, so that is fine. Honestly I did not go into all the details in the MINST example, had that solved before. I have mastered DL largely and for issues I can't solve in reasonable time I find my own solution (and its been a while I have finished this book, now brushing up on functional thinking).