verstack icon indicating copy to clipboard operation
verstack copied to clipboard

Cross validation support during HPO

Open Geethen opened this issue 3 years ago • 6 comments

I used the hyperparameter optimisation and found it really useful- Thanks. I was hoping to carry the same process out but with cross-validation (specifically groupKfold). support for scikit cv would be very useful.

Geethen avatar Oct 03 '22 14:10 Geethen

I Geethen. I presume you are referring to the LGBMTuner class within verstack, right? How big is the dataset you using for training and what is the task type (regression/binary/multiclass)?

DanilZherebtsov avatar Oct 04 '22 09:10 DanilZherebtsov

Yes, thats correct. LGBMTuner within verstack

dataset size: 1.3Gb when stored as a feather with shape roughly 450 000 rows and 707 columns. Task type: regression

Geethen avatar Oct 04 '22 09:10 Geethen

For regression tasks every trial of hyperparameters optimisation within LGBMTuner is carried out on a new random split. This way this is very similar to a cross-validation approach you are seeking. If you are using LGBMTuner with default parameters (200 trials) - that means you will have 200 random train/valid splits during the tuning process. Moreover your dataset is big enough not to worry about additional validation.

DanilZherebtsov avatar Oct 13 '22 05:10 DanilZherebtsov

I've been planning to add a holdout-validation option into LGBMTuner for quite some time now, specifically for time series applications. I guess this will be the motivation. I will let you know when it is available.

DanilZherebtsov avatar Oct 13 '22 13:10 DanilZherebtsov

I'm working with spatial data, so I use groupkfold to limit spatial autocorrelation. Hence my request. Thanks for looking more into this. I think having a tuner that can take any CV option as an argument will be the most flexible.

If I remember correctly, optuna did allow for a sklearn CV strategy to be specified. I could share an example, if that's helpful?

On Thu, 13 Oct 2022, 3:59 pm Danil Zherebtsov, @.***> wrote:

I've been planning to add a holdout-validation option into LGBMTuner for quite some time now, specifically for time series applications. I guess this will be the motivation. I will let you know when it is available.

— Reply to this email directly, view it on GitHub https://github.com/DanilZherebtsov/verstack/issues/18#issuecomment-1277660631, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJBW5I2LFXWDYMBQ2I5OGELWDAIV3ANCNFSM6AAAAAAQ3VME3Q . You are receiving this because you authored the thread.Message ID: @.***>

Geethen avatar Oct 13 '22 14:10 Geethen

If I remember correctly, optuna did allow for a sklearn CV strategy to be specified. I could share an example, if that's helpful?

If you have something handy - please share, it will be a good starting point for me

DanilZherebtsov avatar Oct 18 '22 13:10 DanilZherebtsov