DecisionTree.jl
DecisionTree.jl copied to clipboard
Some questions about `prune_tree`.
The prune_tree
function for classification trees internally calls _prune_run
which computes the purity using the zero-one loss. However, decision trees are built using the entropy purity. I'm not sure if this is done on purpose or if it's a bug.
The latter can be fixed easily, but we might also address the more general problem and make prune_tree
criterion-agnostic by storing the purity of the node in a struct field (which is already a byproduct of tree building) and, instead of recomputing the node purity, have the function refer to that field. This will also make the same prune_tree
function work on both regression and classification trees.
Yes, good catch, there is a discrepancy between the purity measures used in tree building and tree pruning.
It would be nice to make prune_tree
criterion agnostic, but I'm be bit hesitant in added in a new field to Node
. We want to make the trees as light as possible (in memory and on disk), and the computation needed to re-calculate node purities is tiny compared to node splitting. So I'm not sure if this is a good trade-off.
@bensadeghi can you recommend a dataset for testing the memory usage when writing trees to disk?
Adult dataset, see test/classification folder. Try a fully grown tree, a forest of 10 trees and an adaboost of 10 iterations.
@bensadeghi Sadly neither BSON nor JLD currently support 0.7-beta, so I think we will have to wait a little before the new Leaf struct can be implemented.
How do you feel about making all arguments aside from features
and labels
in build_tree
/build_stump
/build_adaboost_stumps
keyword arguments since it gets a little weird having to specify the earlier arguments just to use the later ones?
@Eight1911 The native API has been around for a while, and there are quite a few packages using it, see bottom of this page in the "Used By" section. Adding keyword arguments would break all those packages, and other people's code. We could add deprecation warnings to help with the transition, but I find all this to be unnecessary since we already have the ScikitLearn API supporting keyword arguments and accompanying help docs. I say we just leave the native API as is.
On a completely different topic, is there a reason why certain fields in _split
and fit
are hard-coded to type Int64
?
I ask this because it is causing issues with 32-bit OS.
I'm thinking of changing them to type Int
, which is just an alias for Int32
and Int64
, depending on the OS, and testing it all on an ARM v7 32-bit processor.
Hmm, now that you say it, I agree with you on keeping the current packages backwards compatible. There really are no reasons on forcing them to be Int64
. As a quick fix, you can change it to Int
. But since Int32
only goes up to around 2 billion and since we may have more data points than that, may I suggest that we change the indices to have type M
where M <: Integer
?
If yes, I will send patch in with in a couple of days to fix this issue and the current issue with prune_tree
.
@Eight1911 Perfect, thanks!
Do you think Float64
will pose a problem for 32-bit architectures too? If that's the case, can you suggest an architecture-dependent fixed-sized alternative like Int
?
Now that you mention it, perhaps we should extend all hard-coded Float64
types to subtypes of AbstractFloat
. This hasn't been an issue before, sinceFloat64
is the default float regardless of architecture, but it would be good to support lower precision floats.
I've added a couple tests covering Int32
runs, for classification and regression, added adult and iris datasets to the test suite, and also updated the random and digits tests.
Regarding the new prune_tree
implementation, I find it a bit unintuitive having to enter negative purity_thresh
values for the pruning to take effect. Perhaps we should revisit this.
Discrete entropy is bounded by the log of the number of possible states. So we can squish the range of classification's pruning purity to be in [0, 1]. However, I don't know how I would solve this problem for regression since the criterion used unbounded. (Perhaps, we can squish it using a sigmoid, etc. But that would be way too unintuitive.)
Do you have any suggestions about how this should be fixed in an intuitive way? Once the user understands the model, I feel that using the actual purity used for splitting without other transformation is the most natural way. Perhaps, we can add a warning whenever a positive purity threshold is given that explains the criterion used.
P.S. Sorry I'm a little late with the PR's. I've been super busy with other work.
I'm all for using a [0, 1] range for pruning purity. To maintain continuity on prune_tree
's usage, we would need a purity_thresh
of 1.0 to represent pure leaves (ie no pruning), and decreasing values to represent less pure leaves.
For the regression case, any mapping to [0, 1] that works is fine, be it sigmoid or even a linear function.
I had a go at using a normalized entropy measure to map pruning_thresh
to range [0,1], which worked, but its usability was still not intuitive. The same goes for using the sigmoid function for the regression case of prune_tree
.
For now, I've commented out the new implementations of prune_tree
and reverted to the original. See PR #86 .
With the official release of Julia v0.7 and an RC for Julia v1.0, I would like to release DT v0.8.1. In the process, I'll create a new branch for julia-0.6
and merge the julia-0.7
branch into master.
I've an idea. How about we change the purity from negative of variance to the proportion of explained variance. so instead of using -var(leaf)
as the purity, we use (var(tree) - var(leaf)) / var(tree)
. This is a linear change and will always between 0 and 1. (with a huge bias towards 0)
@Eight1911 It's worth a try to see how the above approach affects how the pruning exercise "feels" like. You can test it out using the Iris pruning runs. The current implementation feels natural and intuitive. If a similar experience can also be replicated for regression, then that'd be great
@bensadeghi I realized that prune_tree
can be done during the tree building. We can add another argument to build_tree
called min_impurity_split
which should handle this kind of thing and be much faster because
- we don't need to recurse through each leaf in the tree again, and
- we don't need to expand the nodes that we are going to prune out anyway.
In fact, we can do this at almost no computational cost, since at each level, the impurity of the leaves have already been calculated to decide the split. Moreover, with this version of pruning, the input thresholds automatically match the criterion. This makes it compatible with user input criteria, whose functionality I'm planning to expose soon.
I think the tree prune_tree
function should be reserved for actual post pruning where you input a validation set and prune out the leaves whose validation accuracy - with respect to some loss function - is lower than some proportion of the training accuracy or some fixed user-input value.
What do you think?
@Eight1911
I don't think we need another pre-pruning criteria. Note that scikit-learn is deprecating min_impurity_split
in favor of min_impurity_decrease
.
I'm, personally, quite comfortable with the current implentation of prune_tree
for post-pruning purposes. It's leaf purity definition is intuative, regardless of what splitting criteria was used to build the tree.
We could instead focus our attention on more impactful development activities, ie optimization of build and apply execution times, or adding AdaBoost functionality for regression (since regression trees now support weights).