Neural-Network-Experiments
Neural-Network-Experiments copied to clipboard
Clarify that the `Softmax` derivative is good-enough
First, thank you so much for this amazing resource and video series! 🙇 Your videos are a gold-standard to understand concepts and polished end-products to impress everyone 🌠
While following along and writing my own implementation in Zig, I added some gradient check tests to ensure my backpropagation code/math was correct and saw that they were failing whenever I used Softmax
. I banged my head against this for a long-while and even compared the network outputs to this implementation only to find it the exact same.
Finally after some external help, I realized the difference between single-input activation functions like Sigmoid
, TanH
, ReLU
and the multi-input activation functions like Softmax
which require more work to find the full derivative. I wrote some notes on the difference or perhaps the source code I ended up with is easier to understand.
Just wanted to add a note to the code here so others don't hit the same pitfall as hard.
It's really interesting how the "good-enough" derivative of Softmax
using only the diagonal elements from the Jacobian matrix, empirically, still works so well for the neural network to converge. The best way I was able to understand this and relate this to a concept that has more research/documentation is stochastic gradient descent which trains with mini-batches to make quick, imperfect but good-enough steps down the cost gradient.