xgboost-adv-workshop-LA icon indicating copy to clipboard operation
xgboost-adv-workshop-LA copied to clipboard

question or to discuss: how to find optimal parameters of xgb?

Open lihang00 opened this issue 8 years ago • 5 comments

I usually do a grid search of the following parameter set:

param_grid = { "learning_rate" : [ 0.1, 0.05, 0.01], "n_estimators" : [ 100, 500, 1000], "max_depth" : [ 4, 8, 12], "min_child_weight" : [ 1, 5, 10], "subsample" : [1, 0.8],
"colsample_bytree" : [1, 0.8] }

Most time ppl can find a much better parameter set.

lihang00 avatar May 31 '16 18:05 lihang00

Instead of varying nrounds (number of trees) I would use a larger number and early stopping. Excellent topic btw, thanks @lihang00.

szilard avatar May 31 '16 18:05 szilard

Early stopping param is not supported in scikit-learn wrapper yet... maybe we can ask whether he want to support those params in scikit-learn wrapper. :)

lihang00 avatar Jun 01 '16 17:06 lihang00

Parameters :

  • eta : step size shrinkage used in update to prevents overfitting.
  • alpha / lambda : L1/L2 regularization term on weights.
  • subsample : subsample ratio of the training instance.
  • colsample_bytree : subsample ratio of columns when constructing each tree.

All these parameters can prevent overfitting, how to choose them. When tuning parameters, can we just tune one (few) of them?

lihang00 avatar Jun 02 '16 05:06 lihang00

FYI https://github.com/dmlc/xgboost/blob/master/doc/how_to/param_tuning.md

Most time default parameters works pretty well.

lihang00 avatar Jun 03 '16 00:06 lihang00

First you need to find stable eta. By stable i mean that you get aproximatively the same results on your choosed Metrics if you re-run the code. It depends of your data, same about CV. Usually 0.1 is fine.

Then make a sequential loop for finding the best max_depth (usually independant of the other parameters).

Then gridSearch or sequential loops for finding subsample and colsample.

Now you've good hyper-parameters. Run you algo once again with a smaller eta (eg 0.01), depending of how much time you want to spend. Usually the smaller, the better.

You definetly won't get the best hyper-parameters, but you will have good ones and more importantly in a decent time. You can also perform this steps with a smaller eta, therefore you'll be able to add more randomness (sample, colsample) but it will take more time.

jacquespeeters avatar Jun 07 '16 12:06 jacquespeeters