dbt-ml-preprocessing icon indicating copy to clipboard operation
dbt-ml-preprocessing copied to clipboard

Narrow scope of one_hot_encoder OHE

Open dataders opened this issue 3 years ago • 0 comments

In reviewing #4, I echo @comaraDOTcom's sentiment when they say they'd be "more inclined to use the output of the macro as a CTE and select a subset of cols in a subsequent CTE." To that end, I'd like to propose two macros (names debatable):

  • one_hot_encoder_wrapper: the existing functionality with include_columns and exclude_columns, and
  • one_hot_encoder which takes only the source_table, source_column, category_values, and handle_unknown params.

To me, the benefits would be:

  • more direct alignment with sklearn.preprocessing.OneHotEncoder and pandas.get_dummies()
  • enable smaller code footprint for adapters that require dispatching
  • better complement existing package functionality such as dbt_utils.star(),
  • this way dbt models that lever one_hot_encoder will look like other dbt models, instead of a single macro call with no SELECT statement.

Example usage

Suppose a table, fruits that:

  • has 3 columns: id, species, and `color; and,
  • the color columns has two values: orange and yellow

goal compiled SQL

SELECT
id,
species,
is_color_orange,
is_color_yellow,
FROM database.fruits

possible uses

SELECT
id,
species,
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}

alternatively if one would like to include or exclude certain columns from the source table, they could do so like this

SELECT
{{ dbt_utils.star(from=ref('fruits'), except=['color']) }},
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}

dataders avatar Mar 12 '21 09:03 dataders