juice icon indicating copy to clipboard operation
juice copied to clipboard

Potential bug in NegativeLogLikelihood impl

Open hweom opened this issue 3 years ago • 9 comments

Just started to explore deep learning and chose Juice as the starting framework, since I want to stick with Rust. Since I'm pretty new to the domain, it might be just my mistake.

I was looking at the NegativeLogLikelihood::compute_output() and I think there is a bug. Instead of

        for &label_value in native_labels {
            let probability_value = native_probabilities[label_value as usize];
            writable_loss.push(-probability_value);
        }

it should be

        let mut offset = 0;
        for &label_value in native_labels {
            let probability_value = native_probabilities[offset + label_value as usize];
            writable_loss.push(-probability_value);
            offset += self.num_classes;
        }

Otherwise we're comparing all labels in the batch with the first output from the batch?

Interesting that I've tried changing it and there were no noticeable effect on MNIST example...

hweom avatar Oct 25 '21 01:10 hweom

@hweom sorry for the long delay, I see your point and I would appreciate a PR for this issue with a testcase. Happy to review. Practically this is not destructive, all values are summed in the following few lines, so some values get more weight and others are omitted.

drahnr avatar Oct 29 '21 12:10 drahnr

Thanks for the answer. I'm still trying to wrap my head around the code. I think I understand why changing this part made no effect on the MNIST example -- since NLL layer is the last one, the output of it is not used anywhere.

Now I have a different question :) This definition of NLL that I found gives the formula:

y = -ln (x)

(here y is the output of the NLL layer and x is input for the correct class).

However, NegativeLogLikelihood::compute_output() does this:

y = -x

And similarly, the input gradient in NegativeLogLikelihood::compute_input_gradient() is computed like this:

for (batch_n, &label_value) in native_labels.iter().enumerate() {
    let index = (num_classes * batch_n) + label_value as usize;
    writable_gradient[index] = -1f32;
}

which is essentially ∂y/∂x = -1. This matches the way the output is computed, but doesn't quite match the definition of NLL.

FWIW, I tried changing compute_input_gradient() to compute the real NLL gradient with

writable_gradient[index] = -1.0 / native_probabilities[index].max(0.001);

(I didn't change compute_output() since it's not used, as mentioned above), but it actually caused training to stop converging.

Should the name of the layer be changed?

hweom avatar Oct 30 '21 23:10 hweom

That is correct. The implementation is pretty old and was done by the original authors of the predecessor if juice.

The only explanation I have is when looking at the expansion of ln(x) which is

ln(x) ~= \sum_{n=0}^{\inf} - x + (1/2)x^2 - (1/3)x^3 + (1/4)x^4 + ..`

now if you terminate after the first term, you get the above. This saves a bunch of computations - ln is still rather slow, but at least two or 3 terms wouldn't hurt, but that would come at a quite significant additional cost of doing 3 more multiplications, a lock call, and an array lookup in the gradient compute.

I added a commit that would improve the approximation up to x^3 in #151

drahnr avatar Oct 31 '21 07:10 drahnr

Is this a Taylor expansion of ln(x)? At which point? It doesn't look like one: example.

I'm probably missing something obvious...

hweom avatar Oct 31 '21 20:10 hweom

https://math.stackexchange.com/questions/585154/taylor-series-for-logx says almost the above, an offset of -1 or respectively +1 is missing, do I guess that's another issue

drahnr avatar Oct 31 '21 21:10 drahnr

You mean the most upvoted answer (https://math.stackexchange.com/a/585158)? Note that it has a formula for -ln(1-x), not ln(x).

hweom avatar Oct 31 '21 21:10 hweom

That's what I meant in my previous comment

drahnr avatar Oct 31 '21 21:10 drahnr

Well, if we take the Taylor expansion suggested by Wolfram:

and simplify it, we'll get this (https://www.wolframalpha.com/input/?i=expand+-%28%28x-1%29+-+1%2F2%28x-1%29%5E2+%2B+1%2F3%28x-1%29%5E3%29)

Or can just add the offset and keep the math cleaner.

hweom avatar Oct 31 '21 21:10 hweom

I think the offset is not really relevant for learning.

drahnr avatar Nov 29 '21 16:11 drahnr