Improve handling of categorical columns in KDQTreePartitioner

Open tms-bananaquit opened this issue 3 years ago • 0 comments

Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.

Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.

The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the min_cutpoint_size proportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node.
We can preprocess data that is called to build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.).
We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.

Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!

Jun 09 '22 20:06 tms-bananaquit