dbt-ml-preprocessing
dbt-ml-preprocessing copied to clipboard
Narrow scope of one_hot_encoder OHE
In reviewing #4, I echo @comaraDOTcom's sentiment when they say they'd be "more inclined to use the output of the macro as a CTE and select a subset of cols in a subsequent CTE." To that end, I'd like to propose two macros (names debatable):
one_hot_encoder_wrapper: the existing functionality withinclude_columnsandexclude_columns, andone_hot_encoderwhich takes only thesource_table,source_column,category_values, andhandle_unknownparams.
To me, the benefits would be:
- more direct alignment with
sklearn.preprocessing.OneHotEncoderandpandas.get_dummies() - enable smaller code footprint for adapters that require dispatching
- better complement existing package functionality such as
dbt_utils.star(), - this way dbt models that lever
one_hot_encoderwill look like other dbt models, instead of a single macro call with noSELECTstatement.
Example usage
Suppose a table, fruits that:
- has 3 columns:
id,species, and `color; and, - the
colorcolumns has two values:orangeandyellow
goal compiled SQL
SELECT
id,
species,
is_color_orange,
is_color_yellow,
FROM database.fruits
possible uses
SELECT
id,
species,
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}
alternatively if one would like to include or exclude certain columns from the source table, they could do so like this
SELECT
{{ dbt_utils.star(from=ref('fruits'), except=['color']) }},
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}