scikit-tree icon indicating copy to clipboard operation
scikit-tree copied to clipboard

Implement split nodes that can consider categorical features

Open adam2392 opened this issue 2 years ago • 3 comments
trafficstars

We would need to enable this in the sklearn fork's splitter. The original PR in upstream sklearn was never merged unfortunately: https://github.com/scikit-learn/scikit-learn/pull/12866.

  1. Generalize the "threshold of the split" as a threshold, or a categorical bit selector
  2. Implement Breiman's shortcut for binary classification with categorical splits
  3. Implement the general categorical split that evaluates up to 2^8 possible random categories for splitting
  4. Implement the Python API layer in BaseDecisionTree and follow the HistGradientBoosting* API patterns

adam2392 avatar Jun 23 '23 18:06 adam2392

Will be closed by: https://github.com/neurodata/scikit-learn/pull/46

adam2392 avatar Jul 19 '23 23:07 adam2392

A benchmarking done using cc18's openml dataset with categorical features would be nice: https://github.com/scikit-learn/scikit-learn/pull/12866#issuecomment-455350207

Basically run sklearn w/o categorical support and one-hot encoding vs w/ categorical support

  • track runtime
  • track performance

compare both.

adam2392 avatar Jul 19 '23 23:07 adam2392

Consider https://arxiv.org/pdf/1908.09874v3.pdf

jovo avatar Aug 19 '23 08:08 jovo