scikit-tree
scikit-tree copied to clipboard
Add new trees that leverage setting leaf nodes differently (quantile and honest trees)
Is your feature request related to a problem? Please describe. We want to implement a base Cython class that extends our fork of scikit-learn with honest and quantile capabilities.
Describe the solution you'd like
Honest trees:
We need to for each passed in X, split this again into two sets: 1 for splitting and 1 for setting leaf nodes. This should maybe be an extended functionality that can be "turned on" when honest: bool=True
Quantile trees:
This would allow us to sketch out what sort of "data" the leaf nodes contain. We also don't want to have to create a new Cython Tree class for each type of new "RF" model we have just to have their "quantile-version". Ideally, we can also specify this from Python as a quantile: bool=True. This might require some more Cython work where we abstract the "Nodes" into a "SplitNode" and "LeafNode". This would replace the current "Node" struct sklearn has.
Oblique trees would create "sub-classes" of the SplitNode struct, and quantile trees would create "sub-classes" of the "LeafNode" struct. The con is this really requires refactoring even more of the sklearn code in our fork, which is undesirable.
Alternatively, we just implement a "QuantileTree" that doesn't require as much refactoring of the sklearn code. We can add some conditional statements that allow it to either fit quantiles, or fit normally. Then to enable support of quantile-predictions in the new tree models (e.g. oblique and morf trees), we just have to pass quantile: bool=True.
Describe alternatives you've considered See: https://github.com/zillow/quantile-forest
Additional context Note by demonstrating this works well with examples and unit tests, this can become more evidence to support refactoring in scikit-learn.
Honesty can be enabled either at the Python, or Cython level.
If at the Python level, then we need to expose a cpdef API to allow the Tree to call that "sets the leaves" using a held-out dataset.
If at the Cython level, then we need to expose a Hyperparameter kwarg honest: bool=False, that splits the data used for splitting and setting leaf parameters. This would have to then generalize the TreeBuilder. This would most likely go into the sklearn fork tho... However, this does complicate the fork because if more breaking changes come to the upstream sklearn tree submodule, this will make syncing the fork more time-consuming.
I am in favor of modifying the fork tho for the sake of not complicating the tree building API.
Alternatively, doing it in Python will probably be simpler and we can just add this easily to the forked version of scikit-learn:
- get subsamples for splitting and leaves
- fit the tree using subsample for splitting
- set the leaves using subsample for leaves
https://github.com/neurodata/honest-forests/blob/main/honest_forests/estimators/tree.py
@sampan501 and @yuxinB to help tackle honesty in trees and then add causaltree model, which adds the two options of fitting:
- propensity model
- double sampling