menelaus
menelaus copied to clipboard
Improve handling of categorical columns in KDQTreePartitioner
Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.
Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.
- The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the
min_cutpoint_sizeproportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node. - We can preprocess data that is called to
build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.). - We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.
Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!