LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

Is it possible to control how the samples are chosen in bagging?

Open simpsus opened this issue 2 years ago • 1 comments

Summary

I have a dataset with a categorical feature that is not used for training, but I would like to use that feature to control how samples are selected during bagging.

Motivation

while harmfull for training, the categorical feature groups the samples and thus improves training accuracy when the groups are not split during training

Description

I would like to have only complete groups of the feature selected during bagging. Implementations I could imagine

  • a categorical variable can be passed so that only groupings of this are considered during bagging. Problem: as the variable is not to be used for training, it cannot be a normal feature in the train_set. Maybe a bagging_groups parameter in the DataSet constructor?
  • giving control to the user on the bagging by being able to pass a function with signature bag(samples, fraction) that returns the selected samples during that iteration

References

the evaluation metric works inside the groups, so using pandas it does look something like

score = data.groupby('feature').apply(lambda df: some_custom_function(df)).mean()

simpsus avatar Feb 03 '23 11:02 simpsus

This is currently not possible as far as I am aware. I just came here to put a request in for this. I have at least two use cases in mind, where it would be very helpful:

  1. When there are multiple records over time for an individual (could be a person, a store, an item etc.). If your intention is to generalize well to new unseen individuals, then sampling the complete data for an individual in each tree is likely helpful.
  2. When you do time-to-event (aka survival) prediction and you don't want a constant hazard function (=could get that via having a single record and using a Poisson target potentially with an offset), but rather want piecewise constant over time. In that case, you would split records into time intervals (e.g. for an individual with an event after 5 years, you'd create records with feature year=1,2,3,4,5 and event count = 0,0,0,0,1 and use Poisson on that). In that case, you want to, again, do bagging for complete individuals.

bjoernholzhauer avatar Feb 22 '23 12:02 bjoernholzhauer