ML icon indicating copy to clipboard operation
ML copied to clipboard

What method is used to compute gini split index on the Classification and regression tree?

Open nizariyah opened this issue 5 years ago • 3 comments

Is it brute force or midpoint?

nizariyah avatar Mar 05 '20 11:03 nizariyah

Hi @nizariyah thanks for the question

Classification Tree minimizes Gini impurity and Extra Tree Classifier minimizes entropy at the leaf nodes of the tree. For regression, both Regression Tree and Extra Tree Regressor minimize variance.

Having that said, for continuous features we use a quantile method similar to the technique described in the CLOUDS paper. The search space for continuous features with this method is typically about 1/4th the size of brute force with no loss in overall accuracy. For categorical features we use brute force.

If the best split fails to reduce the impurity by a user-defined amount (minPurityIncrease), it will be immediately pruned and turned into leaf nodes to avoid overfitting.

Thanks and I'll be happy to answer any followups

andrewdalpino avatar Mar 05 '20 20:03 andrewdalpino

Hi @nizariyah

The CART implementation has been tuned and optimized in the latest commit https://github.com/RubixML/RubixML/commit/89f6991794c9ee5e7a2f358c1cc52167450684a8

Instead of using a 0.25 ratio of quantiles to samples at the split node, we are now using the square root of the number of samples. Even with this more aggressive search heuristic, accuracy has not changed in any of our tests including the House Price Preditor example project.

Comparing this strategy to the median strategy that I believe you are referring to, the Quantile method allows you to use the median strategy when the number of samples at the split point are low as well as choose more precise (and efficient) splits when alot of data is available to the splitting algorithm.

Hope this helps, and thanks again for this question. I took some time to optimize the CART implementation as a result and we're seeing almost an order of magnitude speedup for large datasets.

andrewdalpino avatar Mar 08 '20 10:03 andrewdalpino

Performance comparison of PHP library Random Forest on the Dota 2 dataset, 10 tree, 0.5 subsample ratio, categorical features

random-forest-speed

random-forest-accuracy

andrewdalpino avatar Mar 24 '20 06:03 andrewdalpino