scikit-tree icon indicating copy to clipboard operation
scikit-tree copied to clipboard

Add new trees that leverage setting leaf nodes differently (quantile and honest trees)

Open adam2392 opened this issue 2 years ago • 3 comments
trafficstars

Is your feature request related to a problem? Please describe. We want to implement a base Cython class that extends our fork of scikit-learn with honest and quantile capabilities.

Describe the solution you'd like

Honest trees:

We need to for each passed in X, split this again into two sets: 1 for splitting and 1 for setting leaf nodes. This should maybe be an extended functionality that can be "turned on" when honest: bool=True

Quantile trees:

This would allow us to sketch out what sort of "data" the leaf nodes contain. We also don't want to have to create a new Cython Tree class for each type of new "RF" model we have just to have their "quantile-version". Ideally, we can also specify this from Python as a quantile: bool=True. This might require some more Cython work where we abstract the "Nodes" into a "SplitNode" and "LeafNode". This would replace the current "Node" struct sklearn has.

Oblique trees would create "sub-classes" of the SplitNode struct, and quantile trees would create "sub-classes" of the "LeafNode" struct. The con is this really requires refactoring even more of the sklearn code in our fork, which is undesirable.

Alternatively, we just implement a "QuantileTree" that doesn't require as much refactoring of the sklearn code. We can add some conditional statements that allow it to either fit quantiles, or fit normally. Then to enable support of quantile-predictions in the new tree models (e.g. oblique and morf trees), we just have to pass quantile: bool=True.

Describe alternatives you've considered See: https://github.com/zillow/quantile-forest

Additional context Note by demonstrating this works well with examples and unit tests, this can become more evidence to support refactoring in scikit-learn.

adam2392 avatar Feb 08 '23 19:02 adam2392

Honesty can be enabled either at the Python, or Cython level.

If at the Python level, then we need to expose a cpdef API to allow the Tree to call that "sets the leaves" using a held-out dataset.

If at the Cython level, then we need to expose a Hyperparameter kwarg honest: bool=False, that splits the data used for splitting and setting leaf parameters. This would have to then generalize the TreeBuilder. This would most likely go into the sklearn fork tho... However, this does complicate the fork because if more breaking changes come to the upstream sklearn tree submodule, this will make syncing the fork more time-consuming.

I am in favor of modifying the fork tho for the sake of not complicating the tree building API.

adam2392 avatar Feb 28 '23 21:02 adam2392

Alternatively, doing it in Python will probably be simpler and we can just add this easily to the forked version of scikit-learn:

  1. get subsamples for splitting and leaves
  2. fit the tree using subsample for splitting
  3. set the leaves using subsample for leaves

https://github.com/neurodata/honest-forests/blob/main/honest_forests/estimators/tree.py

adam2392 avatar Mar 02 '23 06:03 adam2392

@sampan501 and @yuxinB to help tackle honesty in trees and then add causaltree model, which adds the two options of fitting:

  • propensity model
  • double sampling

adam2392 avatar Mar 02 '23 21:03 adam2392