Preventing overfitting when evaluating many hyperparameters
In #18 I propose using a grid search to fit the classifier hyperparameters (notebook). We end up with average performance across cross-validation folds for many hyperparameter combinations. Here's the performance visualization from the notebook:

So the question is given a performance grid, how do we pick the optimal parameter combination? Picking just the highest performer can be a recipe for overfitting.
Here's a sklearn guide that doesn't answer my question but is still helpful. See also https://github.com/cognoma/machine-learning/issues/19#issuecomment-235927462 where overfitting has been mentioned. I'm paging @antoine-lizee, who has dealt with this issue in the past, and who can hopefully provide solutions from afar as he lives in the hexagon.
The sklearn documentation isn't great for describing how they define optimal parameters...they also seem to muddle usage of training/testing/holdout! See discussion about this here.
For this type of data, i think the best way to define optimal is based on "average test-set cross validation performance". Looking at the source code it looks like the closest thing to this is setting iid = True. It's the default setting so I don't think we should worry too much about overfitting if we take the max here.
It's the default setting so I don't think we should worry too much about overfitting if we take the max here.
I guess we should just see whether overfitting becomes an issue from the max-cross-validated performance criterion. I'm worried that it will, especially if we want to evaluate a large number of hyperparameter settings. The example figure above required 63 combinations. So the cruel reality of max grid search is that the more extensively you evaluate the possibilities, the more overfitting you will endure.
For example, in the R glmnet package there are two builtin options for choosing the regularization strength from cross-validaton:
lambda.minis the value of λ that gives minimum mean cross-validated error. The other λ saved islambda.1se, which gives the most regularized model such that error is within one standard error of the minimum. To use that, we only need to replace lambda.min with lambda.1se above.
In my personal experience, lambda.1se produces better models. Unfortunately, for our general grid search, many settings don't have a natural directionality that would allow us to use the glmnet approach.