Grokking-Deep-Learning icon indicating copy to clipboard operation
Grokking-Deep-Learning copied to clipboard

Ch11 Predicting Movie Reviews - error in back propagation code

Open harpreetmann24 opened this issue 4 years ago • 9 comments

There seems to be small mistake in the Predicting Movie review code. Here is the code

        x,y = (input_dataset[i],target_dataset[i])
        layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) #embed + sigmoid
        layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) # linear + softmax
          
        layer_2_delta = layer_2 - y # compare pred with truth
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) #backprop

        weights_0_1[x] -= layer_1_delta * alpha
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha

Error In the forward pass, the code apples sigmoid activation function. Therefore when we calculate layer_1_delta - should we not multiple with derivative of sigmoid? My understanding was that either we should not apply sigmoid function on layer_1. If we are applying the sigmoid function then in backprop we should multiply with its derivatives.

harpreetmann24 avatar Sep 20 '20 17:09 harpreetmann24

I have noticed this too and attempted this code for the weight updates:

 def sigmoidd(x):
             s = sigmoid(x)
             return s * (1 - s)

        dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2)))
        weights_1_2 -= dw12
        weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates

This converges much slower, but the similarity comparisons seem to be a better fit.

EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-... Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.

W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1

W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]

Unilmalas avatar Mar 21 '21 17:03 Unilmalas

I have the same question. I am really a newbie of deep learning. Here is my thought. The most important thing of back prop algorithm is giving previous layer up or down pressure based on the delta. The back prop algorithm can still work without the derivative item. I am not sure how big the impact will be.

My guessing is the author tried the correct one with derivative, but soon the author realized this example worked better without derivative. The same explanation applies to chapter 9. The derivative of softmax is not 1/(batch_size * layer_2.shape[0]).

JimChengLin avatar Sep 07 '21 07:09 JimChengLin

I have noticed this too and attempted this code for the weight updates:

 def sigmoidd(x):
             s = sigmoid(x)
             return s * (1 - s)

        dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2)))
        weights_1_2 -= dw12
        weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates

This converges much slower, but the similarity comparisons seem to be a better fit.

EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-... Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.

W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1

W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]

Do we really need to consider Taylor series? Wouldn't it make things more complicated?

JimChengLin avatar Sep 07 '21 07:09 JimChengLin

For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.

Unilmalas avatar Sep 07 '21 08:09 Unilmalas

For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.

Nice insight! May I ask have you ever encountered any problem in chapter 9? How can the author come up the 1/(batch_size * layer_2.shape[0]) term?

JimChengLin avatar Sep 07 '21 10:09 JimChengLin

I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.

Unilmalas avatar Sep 07 '21 12:09 Unilmalas

I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.

How could temp = (output - true) and output = temp/len(true) become 1/(batch_size * layer_2.shape[0]). BTW, what is the term true?

JimChengLin avatar Sep 07 '21 13:09 JimChengLin

The book is great besides those confusing code fragments. I am mainly an infrastructure engineer so I just skip anything I cannot understand then focus on something I could figure out. It's kinda glad to talk someone else who also stuck part of the book.

JimChengLin avatar Sep 07 '21 13:09 JimChengLin

I like it too, overall, despite the bothersome issues, gave it a good Amazon review. true is the truth, I assume, i.e. the value we are training against. Poor choice, agreed, that is even a reserved word in Python. Well, this is what I don't get either, I can the length (this comes from the exponentials in the softmax raised to 0 = 1. But the exponents for the 1s would be e and so really not sure there). layer_2.shape[0] also gives the length along axis 0, so that is fine. Honestly I did not go into all the details in the MINST example, had that solved before. I have mastered DL largely and for issues I can't solve in reasonable time I find my own solution (and its been a while I have finished this book, now brushing up on functional thinking).

Unilmalas avatar Sep 07 '21 13:09 Unilmalas