Chapter 2 Exercise #2 - Why using exponential distribution?
In chapter 2, exercise #2 solution uses exponential distribution for the randomized search of gamma. It explains that we should use it if we have an idea of what's the range of the value should be. I'm still confused what's the benefit of using exponential distribution rather than let say uniform distribution or normal distribution in this case though. Can anyone help explain?
Hi Sarun,
Thanks for your question.
Suppose you sample a hyperparameter K from a uniform distribution between 0 and 100. Only 1% of the samples will be lower than 1. And only 0.01% will be lower than 0.01. So basically you will not be exploring these tiny values very much. That's fine if you expect the optimal value of K to be somewhere between 1 and 100, but if it happens to be smaller than 0.1, then you're out of luck, you probably won't find it.
Using the Normal distribution instead of the uniform distribution would not help. On the contrary, it would "focus" the search around a particular value. So it only makes sense if you really expect the optimal value to be very near this central value.
However, if you sample a number E from a uniform distribution between -5 and +5, then you use the hyperparameter K=10**E, then you will explore small values of K just as much as large values. Indeed E=-3 is just as likely as E=+3, and these two outcomes result in K=0.001 or K=1000. This is what the reciprocal distribution does. It's useful when you really have no idea about the scale of a hyperparameter.
Somewhere in between these two extreme options (uniform or reciprocal) lies the exponential distribution. It also explores various scales, but it is a bit narrower than the reciprocal distribution.
So in short, use:
- The Normal distribution with mean µ and standard deviation σ if you expect the optimal value of K to be very near µ. The more confident you are, the smaller σ can be. But frankly when you're doing hyperparameter search, it's usually because you have no idea what the optimal value should be. So in general don't use this option.
- The Uniform distribution between a and b if you expect K to be somewhere in between, with the scale of the largest of the two. For example, uniform between -10,000 and +10 is not much different from searching betweem -10,000 and -1,000, since only ~10% of the time will the search look at a value between -1,000 and +10. This option is fine when the range is fairly narrow, like searching for the number of layers in a neural net, when you expect the optimal to be say between 1 and 10.
- The exponential distribution when you're not sure about the scale of the hyperparameter, but you have a rough idea. Like a learning rate that you expect to be optimal between 10-4 and 10-2.
- The reciprocal distribution when you have no idea about the scale of the hyperparameter, say a regularization hyperparameter.
Hope this helps.