menelaus icon indicating copy to clipboard operation
menelaus copied to clipboard

Improve handling of categorical columns in KDQTreePartitioner

Open tms-bananaquit opened this issue 3 years ago • 0 comments

Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.

Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.

  1. The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the min_cutpoint_size proportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node.
  2. We can preprocess data that is called to build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.).
  3. We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.

Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!

tms-bananaquit avatar Jun 09 '22 20:06 tms-bananaquit