LinXGBoost icon indicating copy to clipboard operation
LinXGBoost copied to clipboard

Regress only on a subset of features

Open guyguyguy1234 opened this issue 4 years ago • 4 comments

I am not sure how difficult it would be to implement this. Is it possible to add a feature that would allow one to regress in the leaves only on a subset of features? E.g. if we have 3 features f1, f2, f3, the xgb trees are constructed by using features f1, f2 (or all of them) but the linear regression is done on features f2, f3 (notice the overlap).

Thanks.

guyguyguy1234 avatar Jan 02 '21 02:01 guyguyguy1234

That's possible. It is pretty easy to use a subset of features for the construction of the trees. For the linear regression, it is also likely to be straightforward. However, if I had to change something, I would try to speed up LinXGBoost by pre-processing the data.

ldv1 avatar Jan 07 '21 20:01 ldv1

It seems to me that not having that feature is going to create significant problems when one doesn't have large datasets to ensure that in each leaf there are enough points to do meaningful (avoid overfitting) linear regression, especially if the number of features is significant.

guyguyguy1234 avatar Jan 14 '21 21:01 guyguyguy1234

LinXGBoost is a cool idea and can be really useful in some use cases. Quick question: Since it is possible to use only a subset of features to do linear regression at leaf nodes, is it possible to regress against a special feature that actually is not used in the tree building? For example, we have f1, f2, f3 three normal features for tree building, but at each leaf node, we do linear or maybe even high order polynomial fitting against a special feature f4. Do you think that is possible in LinXGBoost @ldv1 ? Some hyper-parameters will need to be introduced, such as min child node size for performing regression, or the min range of the special feature allowed to perform the regression, etc. I understand from pure ML and data processing point of view this might sound silly, but in certain use cases, this is something that can really help.

zanemarkson avatar Sep 25 '21 17:09 zanemarkson

For example, creating a repeated transaction index on each leaf and using other features like lat and long to find the best geographic partitions. I was able to do this adapting the Linear Model Tree from Logan Dillard and it worked like a charm but was restricted to only one tree. It split the US in very intuitive ways with very distinct value indices for the coasts, midwest and south. I have been waiting for LinXGBoost or LightGBM to offer this feature in the Python environment. I am not sure if other packages like Cubist in R offer this flexibility. Pls. let me know if you find one with this feature!

cc22226 avatar Jan 04 '23 19:01 cc22226