seismometer icon indicating copy to clipboard operation
seismometer copied to clipboard

DRAFT: add other_value and top_k transform for cohorts

Open gbowlin opened this issue 5 months ago • 0 comments

Overview

An implementation of a cohort transformation that allows creation of an "Other" placeholder value for cohort columns that might have many options, but where we expect a long tail of small counts that can be meaningfully grouped together as "Other", or ignored by marking as np.nan or None.

Description of changes

Adds a cohort transform to allow renaming small count columns to an other_value group.

  cohorts:
    - source: many_values_column 
      display_name: All Different Values
    - source: many_values_column 
      display_name: Top 5 or Other
      top_k: 5
      other_value: "Other"

Author Checklist

  • [ ] Linting passes; run early with pre-commit hook.
  • [ ] Tests added for new code and issue being fixed.
  • [ ] Added type annotations and full numpy-style docstrings for new methods.
  • [ ] Draft your news fragment in new changelog/ISSUE.TYPE.rst files; see changelog/README.md.

gbowlin avatar Aug 01 '25 14:08 gbowlin