machine-learning icon indicating copy to clipboard operation
machine-learning copied to clipboard

Preventing overfitting when evaluating many hyperparameters

Open dhimmel opened this issue 9 years ago • 2 comments

In #18 I propose using a grid search to fit the classifier hyperparameters (notebook). We end up with average performance across cross-validation folds for many hyperparameter combinations. Here's the performance visualization from the notebook:

cross-validated performance grid

So the question is given a performance grid, how do we pick the optimal parameter combination? Picking just the highest performer can be a recipe for overfitting.

Here's a sklearn guide that doesn't answer my question but is still helpful. See also https://github.com/cognoma/machine-learning/issues/19#issuecomment-235927462 where overfitting has been mentioned. I'm paging @antoine-lizee, who has dealt with this issue in the past, and who can hopefully provide solutions from afar as he lives in the hexagon.

dhimmel avatar Jul 28 '16 16:07 dhimmel

The sklearn documentation isn't great for describing how they define optimal parameters...they also seem to muddle usage of training/testing/holdout! See discussion about this here.

For this type of data, i think the best way to define optimal is based on "average test-set cross validation performance". Looking at the source code it looks like the closest thing to this is setting iid = True. It's the default setting so I don't think we should worry too much about overfitting if we take the max here.

gwaybio avatar Jul 29 '16 20:07 gwaybio

It's the default setting so I don't think we should worry too much about overfitting if we take the max here.

I guess we should just see whether overfitting becomes an issue from the max-cross-validated performance criterion. I'm worried that it will, especially if we want to evaluate a large number of hyperparameter settings. The example figure above required 63 combinations. So the cruel reality of max grid search is that the more extensively you evaluate the possibilities, the more overfitting you will endure.

For example, in the R glmnet package there are two builtin options for choosing the regularization strength from cross-validaton:

lambda.min is the value of λ that gives minimum mean cross-validated error. The other λ saved is lambda.1se, which gives the most regularized model such that error is within one standard error of the minimum. To use that, we only need to replace lambda.min with lambda.1se above.

In my personal experience, lambda.1se produces better models. Unfortunately, for our general grid search, many settings don't have a natural directionality that would allow us to use the glmnet approach.

dhimmel avatar Aug 01 '16 14:08 dhimmel