mlpack
mlpack copied to clipboard
More categorical and numeric split types for decision trees
Right now, the decision tree implementation in src/mlpack/methods/decision_tree/
has only the AllCategoricalSplit
for categorical splits and the BestBinaryNumericSplit
for numeric splits. Ideally, we would like to expand this to handle some other types of splits.
This is a very open-ended issue: we should survey the literature, find decent split ideas to add, and then implement and test them.
The primary split I am thinking about as I write this ticket is something faster than BestBinaryNumericSplit
based on sampling: instead of searching every possibly binary numeric split exhaustively, merely sampling a few could be sufficient.
Another interesting idea is "Extremely Randomized Trees": http://orbi.ulg.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf
In that idea, you don't use the data to determine the split, you merely choose the split randomly. (That tends to be best in ensemble settings, and we don't have a random forest right now, but that will change soon-ish.)
@rcurtin Looking forward to work on this issue. So should I implement an "Extremely Randomized Tree" in this case?
Sure, that is one idea. I think that probably the ERT won't be useful until we have some kind of random forest class to ensemble them, but I am working on a random forest implementation that could use any type of tree, so that won't be a problem. :)
Here is another potential idea for a numeric split that I came across today: http://web.cs.iastate.edu/~honavar/elomaa-multisplit.pdf
Any updates on this issue.
I would like to take up this issue and implement ERT for the same, now that Random Forest is in place.
Also, I had an idea of implementing post pruning for the decision tree. This is because post pruning can help both interpretability and generalization performance.
Any suggestions for the same.
There have been no updates on this, so you are welcome to implement something.
Sure, thanks. I would like to take up post-pruning for decision tree first and then go ahead with ERT. I will keep updating here as I go ahead with my plan.
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! :+1:
Hey @rcurtin, is this the issue which you were mentioning yesterday about the pruning of decision tree (refer to the previous comment by Manthan-R-Sheth).
Also, I am interested in working on this. Since now we have a random forest, shall I take this issue ?
Hey @rcurtin, is there any method still left. I would like to work on this.
Should I raise a pr for the addition of random splitting for categorical features in decision tree ?