Claim about the activation functions. Index 37
In the french version "Deep Learning avec Keras et TensorFlow - 2e éd. - Mise en oeuvre et cas concrets"
page 61 index number 37 below the page
You say that biological neurons seems to implement a sigmoid curve function, which has led researchers to persist in using them, and that therefore this is an example of a case where the analogy with nature may have been misleading.
However, it seems to me that this statement is inaccurate, because based on the wikipedia page https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#cite_note-Hahnloser2000-1 or the abstract of the original publication https://www.nature.com/articles/35016072 ReLu also seems to have "strong biological motivations".
I don't have enough knowledge in neuroscience to know if this is a precise point on which there is not yet a consensus even though I wanted to raise the question anyway.
Thanks for your feedback @atonkamanda, that's a very interesting remark.
I think the timeline looks like this:
- 1940s: researchers create the first artificial neuron and they use a step function because it has been observed that biological neurons exhibit this step-like behavior. So many biological systems saturate that this seems quite reasonable (in fact all biological systems saturate at one point, but some have a fairly linear region that they can operate in).
- 1960s: among many explorations, Fukushima et al. propose using the rectifier function
max(0, x), now known as ReLU, but apparently people kept using step functions. Perhaps the benefits of ReLU were not apparent yet because there was no effective way of training neural nets back then. And perhaps the fact that it does not seem biologically plausible (due to the lack of saturation) may have played a role. It's hard to say. - 1980s: when backprop is introduced, the sigmoid activation function is chosen because it looks very similar to a step function but it has the nice feature of being entirely differentiable, with a non zero derivative everywhere, so it plays nicely with gradient descent. The fact that biological saturation is often modelled with the sigmoid function may also have played a role. I'm not sure the ReLU activation function was even considered back then. We'd have to ask old timers like LeCun or Hinton. Note: I'm not entirely sure but I think it was Hinton who said that people stuck with sigmoid because it seemed biologically plausible. It may have been a self-critique.
- 2000: a paper suggests that the rectifier may after all be biologically plausible. This paper got plenty of citations, but many of them are actually quite recent. It was the ice age of neural nets anyway, so most researchers were busy with other topics.
- 2010s: ReLU are popularized by Glorot, Bordes and Benjio. I suspect that the 2000 paper got more traction then, to explain why ReLU is so successful.
Does this sound reasonable?