DecisionTree.jl icon indicating copy to clipboard operation
DecisionTree.jl copied to clipboard

Some questions about `prune_tree`.

Open Eight1911 opened this issue 6 years ago • 17 comments

The prune_tree function for classification trees internally calls _prune_run which computes the purity using the zero-one loss. However, decision trees are built using the entropy purity. I'm not sure if this is done on purpose or if it's a bug.

The latter can be fixed easily, but we might also address the more general problem and make prune_tree criterion-agnostic by storing the purity of the node in a struct field (which is already a byproduct of tree building) and, instead of recomputing the node purity, have the function refer to that field. This will also make the same prune_tree function work on both regression and classification trees.

Eight1911 avatar Jul 09 '18 08:07 Eight1911

Yes, good catch, there is a discrepancy between the purity measures used in tree building and tree pruning. It would be nice to make prune_tree criterion agnostic, but I'm be bit hesitant in added in a new field to Node. We want to make the trees as light as possible (in memory and on disk), and the computation needed to re-calculate node purities is tiny compared to node splitting. So I'm not sure if this is a good trade-off.

bensadeghi avatar Jul 09 '18 09:07 bensadeghi

@bensadeghi can you recommend a dataset for testing the memory usage when writing trees to disk?

Eight1911 avatar Jul 10 '18 09:07 Eight1911

Adult dataset, see test/classification folder. Try a fully grown tree, a forest of 10 trees and an adaboost of 10 iterations.

bensadeghi avatar Jul 10 '18 09:07 bensadeghi

@bensadeghi Sadly neither BSON nor JLD currently support 0.7-beta, so I think we will have to wait a little before the new Leaf struct can be implemented.

How do you feel about making all arguments aside from features and labelsin build_tree/build_stump/build_adaboost_stumps keyword arguments since it gets a little weird having to specify the earlier arguments just to use the later ones?

Eight1911 avatar Jul 11 '18 18:07 Eight1911

@Eight1911 The native API has been around for a while, and there are quite a few packages using it, see bottom of this page in the "Used By" section. Adding keyword arguments would break all those packages, and other people's code. We could add deprecation warnings to help with the transition, but I find all this to be unnecessary since we already have the ScikitLearn API supporting keyword arguments and accompanying help docs. I say we just leave the native API as is.

On a completely different topic, is there a reason why certain fields in _split and fit are hard-coded to type Int64? I ask this because it is causing issues with 32-bit OS. I'm thinking of changing them to type Int, which is just an alias for Int32 and Int64, depending on the OS, and testing it all on an ARM v7 32-bit processor.

bensadeghi avatar Jul 14 '18 09:07 bensadeghi

Hmm, now that you say it, I agree with you on keeping the current packages backwards compatible. There really are no reasons on forcing them to be Int64. As a quick fix, you can change it to Int. But since Int32 only goes up to around 2 billion and since we may have more data points than that, may I suggest that we change the indices to have type M where M <: Integer?

If yes, I will send patch in with in a couple of days to fix this issue and the current issue with prune_tree.

Eight1911 avatar Jul 14 '18 10:07 Eight1911

@Eight1911 Perfect, thanks!

bensadeghi avatar Jul 14 '18 10:07 bensadeghi

Do you think Float64 will pose a problem for 32-bit architectures too? If that's the case, can you suggest an architecture-dependent fixed-sized alternative like Int?

Eight1911 avatar Jul 14 '18 18:07 Eight1911

Now that you mention it, perhaps we should extend all hard-coded Float64 types to subtypes of AbstractFloat. This hasn't been an issue before, sinceFloat64 is the default float regardless of architecture, but it would be good to support lower precision floats.

bensadeghi avatar Jul 15 '18 01:07 bensadeghi

I've added a couple tests covering Int32 runs, for classification and regression, added adult and iris datasets to the test suite, and also updated the random and digits tests.

Regarding the new prune_tree implementation, I find it a bit unintuitive having to enter negative purity_thresh values for the pruning to take effect. Perhaps we should revisit this.

bensadeghi avatar Jul 17 '18 12:07 bensadeghi

Discrete entropy is bounded by the log of the number of possible states. So we can squish the range of classification's pruning purity to be in [0, 1]. However, I don't know how I would solve this problem for regression since the criterion used unbounded. (Perhaps, we can squish it using a sigmoid, etc. But that would be way too unintuitive.)

Do you have any suggestions about how this should be fixed in an intuitive way? Once the user understands the model, I feel that using the actual purity used for splitting without other transformation is the most natural way. Perhaps, we can add a warning whenever a positive purity threshold is given that explains the criterion used.

P.S. Sorry I'm a little late with the PR's. I've been super busy with other work.

Eight1911 avatar Jul 17 '18 14:07 Eight1911

I'm all for using a [0, 1] range for pruning purity. To maintain continuity on prune_tree's usage, we would need a purity_thresh of 1.0 to represent pure leaves (ie no pruning), and decreasing values to represent less pure leaves. For the regression case, any mapping to [0, 1] that works is fine, be it sigmoid or even a linear function.

bensadeghi avatar Jul 18 '18 04:07 bensadeghi

I had a go at using a normalized entropy measure to map pruning_thresh to range [0,1], which worked, but its usability was still not intuitive. The same goes for using the sigmoid function for the regression case of prune_tree.

For now, I've commented out the new implementations of prune_tree and reverted to the original. See PR #86 .

With the official release of Julia v0.7 and an RC for Julia v1.0, I would like to release DT v0.8.1. In the process, I'll create a new branch for julia-0.6 and merge the julia-0.7 branch into master.

bensadeghi avatar Aug 09 '18 12:08 bensadeghi

I've an idea. How about we change the purity from negative of variance to the proportion of explained variance. so instead of using -var(leaf) as the purity, we use (var(tree) - var(leaf)) / var(tree). This is a linear change and will always between 0 and 1. (with a huge bias towards 0)

Eight1911 avatar Sep 21 '18 01:09 Eight1911

@Eight1911 It's worth a try to see how the above approach affects how the pruning exercise "feels" like. You can test it out using the Iris pruning runs. The current implementation feels natural and intuitive. If a similar experience can also be replicated for regression, then that'd be great

bensadeghi avatar Sep 21 '18 06:09 bensadeghi

@bensadeghi I realized that prune_tree can be done during the tree building. We can add another argument to build_tree called min_impurity_split which should handle this kind of thing and be much faster because

  1. we don't need to recurse through each leaf in the tree again, and
  2. we don't need to expand the nodes that we are going to prune out anyway.

In fact, we can do this at almost no computational cost, since at each level, the impurity of the leaves have already been calculated to decide the split. Moreover, with this version of pruning, the input thresholds automatically match the criterion. This makes it compatible with user input criteria, whose functionality I'm planning to expose soon.

I think the tree prune_tree function should be reserved for actual post pruning where you input a validation set and prune out the leaves whose validation accuracy - with respect to some loss function - is lower than some proportion of the training accuracy or some fixed user-input value.

What do you think?

Eight1911 avatar Sep 25 '18 07:09 Eight1911

@Eight1911 I don't think we need another pre-pruning criteria. Note that scikit-learn is deprecating min_impurity_split in favor of min_impurity_decrease.

I'm, personally, quite comfortable with the current implentation of prune_tree for post-pruning purposes. It's leaf purity definition is intuative, regardless of what splitting criteria was used to build the tree.

We could instead focus our attention on more impactful development activities, ie optimization of build and apply execution times, or adding AdaBoost functionality for regression (since regression trees now support weights).

bensadeghi avatar Sep 26 '18 04:09 bensadeghi