xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Feature request: Send null features down both branches

Open aeftimia opened this issue 3 years ago • 5 comments

My understanding is that XGBoost currently handles null values as separate datatypes, and will split on whether the data is null or not before moving on to what the value is if not null. This is very reasonable if the existence of a null value has predictive power over the target variable or even if it is uncorrelated. But consider a demand forecasting problem in which a feature went missing around peak season. An XGBoost model built on this data could learn that missing values of this feature indicate high demand. If the feature being null were a boolean variable, we would of course discard it to prevent the model from considering this conclusion, but XGBoost's approach effectively forces the use of this feature. Since XGBoost is built on decision trees, it is in a unique position to handle null values with a modified tree learning algorithm. The idea is null features should be treated as weighted leaves in both splits and weighed according to the proportion of leaves with non-null data in each branch.

Of course, this only works if non-null values of the feature generally provide intended predictive power. Going back to the forecasting example, if a feature went live during peak season, then went back offline, null values of the feature could still become an indicator of high demand. That said, dropping the feature entirely in this scenario is probably more sensible. Overall, I'd rather have the option of dropping a mostly useless feature than let a rare instance of its absence potentially spoil its predictive power should it go missing again.

aeftimia avatar Sep 17 '22 14:09 aeftimia

XGBoost doesn't treat the null value as a separate type (although I'm not entirely sure what "separate datayptes" means). Whether the missing value should go to the left branch or the right branch of a split depends on which direction gives more gain. This is decided per-node.

trivialfis avatar Sep 18 '22 17:09 trivialfis

Oh my understanding was xgboost has a separate split for whether the data is null (e.g. is of a separate "null type", and if not, proceed with dealing with it as a normal float). Based on your description, it sounds branches are determined without using null values, then null values are assigned to the branch that minimizes loss. I think this would still have the same pitfall in the scenario I described. So the request remains to have the option of sending null values down both branches in a smart way to avoid overfitting on the existence of null values.

aeftimia avatar Sep 18 '22 17:09 aeftimia

It's interesting to think about how to modify the algorithm to combat overfitting. Thank you for the reference, I will look into the wiki deeper and see the relationship between the proposed modification and XGBoost tree split. At the same time, please feel free to look into XGBoost's implementation: https://github.com/dmlc/xgboost/blob/70df36c99cf7112319a02c25f7aac3f39a718f33/src/tree/hist/evaluate_splits.h#L211

trivialfis avatar Sep 19 '22 21:09 trivialfis

Thanks for the link and I really appreciate you looking into the feature request! The gist of the the modification would be associate null values with both branches but weight each possibility with the proportion of non-null instances that are sent down each branch. The intuition being that conditioned on the previous branches, null values would have a probability of being sent down either branch equal to the proportion of non-null examples sent down those same branches.

I'm not well versed in C++, but if you point me to where null values are handled during training and inference I could take a crack at making the appropriate changes.

On Mon, Sep 19, 2022, 5:52 PM Jiaming Yuan @.***> wrote:

It's interesting to think about how to modify the algorithm to combat overfitting. Thank you for the reference, I will look into the wiki deeper and see the relationship between the proposed modification and XGBoost tree split. At the same time, please feel free to look into XGBoost's implementation: https://github.com/dmlc/xgboost/blob/70df36c99cf7112319a02c25f7aac3f39a718f33/src/tree/hist/evaluate_splits.h#L211

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/8249#issuecomment-1251600847, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDLXXUZWGN7KS2DSHXOYWLV7DOBNANCNFSM6AAAAAAQPA4WVI . You are receiving this because you authored the thread.Message ID: @.***>

aeftimia avatar Sep 20 '22 15:09 aeftimia

The split evaluation function works on a gradient histogram (which is an approximation of the actual gain function). As a result, the missing values are represented by the difference between the parent node gradient sum and the leaf node gradient sums. Perhaps the one hot split is easier to understand at first: https://github.com/dmlc/xgboost/blob/70df36c99cf7112319a02c25f7aac3f39a718f33/src/tree/hist/evaluate_splits.h#L64

Related doc: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html

trivialfis avatar Sep 21 '22 07:09 trivialfis