scikit-tree
scikit-tree copied to clipboard
Implement split nodes that can consider categorical features
trafficstars
We would need to enable this in the sklearn fork's splitter. The original PR in upstream sklearn was never merged unfortunately: https://github.com/scikit-learn/scikit-learn/pull/12866.
- Generalize the "threshold of the split" as a threshold, or a categorical bit selector
- Implement Breiman's shortcut for binary classification with categorical splits
- Implement the general categorical split that evaluates up to 2^8 possible random categories for splitting
- Implement the Python API layer in
BaseDecisionTreeand follow theHistGradientBoosting*API patterns
Will be closed by: https://github.com/neurodata/scikit-learn/pull/46
A benchmarking done using cc18's openml dataset with categorical features would be nice: https://github.com/scikit-learn/scikit-learn/pull/12866#issuecomment-455350207
Basically run sklearn w/o categorical support and one-hot encoding vs w/ categorical support
- track runtime
- track performance
compare both.
Consider https://arxiv.org/pdf/1908.09874v3.pdf