Neural-Network-Experiments Clarify that the `Softmax` derivative is good-enough

Clarify that the `Softmax` derivative is good-enough

Open MadLittleMods opened this issue 1 year ago • 0 comments

First, thank you so much for this amazing resource and video series! 🙇 Your videos are a gold-standard to understand concepts and polished end-products to impress everyone 🌠

While following along and writing my own implementation in Zig, I added some gradient check tests to ensure my backpropagation code/math was correct and saw that they were failing whenever I used Softmax. I banged my head against this for a long-while and even compared the network outputs to this implementation only to find it the exact same.

Finally after some external help, I realized the difference between single-input activation functions like Sigmoid, TanH, ReLU and the multi-input activation functions like Softmax which require more work to find the full derivative. I wrote some notes on the difference or perhaps the source code I ended up with is easier to understand.

Just wanted to add a note to the code here so others don't hit the same pitfall as hard.

It's really interesting how the "good-enough" derivative of Softmax using only the diagonal elements from the Jacobian matrix, empirically, still works so well for the neural network to converge. The best way I was able to understand this and relate this to a concept that has more research/documentation is stochastic gradient descent which trains with mini-batches to make quick, imperfect but good-enough steps down the cost gradient.

Dec 15 '23 15:12 MadLittleMods

Neural-Network-Experiments Neural-Network-Experiments copied to clipboard

Clarify that the `Softmax` derivative is good-enough

Neural-Network-Experiments
Neural-Network-Experiments copied to clipboard