dbt-ml-preprocessing
dbt-ml-preprocessing copied to clipboard
Narrow scope of one_hot_encoder OHE
In reviewing #4, I echo @comaraDOTcom's sentiment when they say they'd be "more inclined to use the output of the macro as a CTE and select a subset of cols in a subsequent CTE." To that end, I'd like to propose two macros (names debatable):
-
one_hot_encoder_wrapper
: the existing functionality withinclude_columns
andexclude_columns
, and -
one_hot_encoder
which takes only thesource_table
,source_column
,category_values
, andhandle_unknown
params.
To me, the benefits would be:
- more direct alignment with
sklearn.preprocessing.OneHotEncoder
andpandas.get_dummies()
- enable smaller code footprint for adapters that require dispatching
- better complement existing package functionality such as
dbt_utils.star()
, - this way dbt models that lever
one_hot_encoder
will look like other dbt models, instead of a single macro call with noSELECT
statement.
Example usage
Suppose a table, fruits
that:
- has 3 columns:
id
,species
, and `color; and, - the
color
columns has two values:orange
andyellow
goal compiled SQL
SELECT
id,
species,
is_color_orange,
is_color_yellow,
FROM database.fruits
possible uses
SELECT
id,
species,
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}
alternatively if one would like to include or exclude certain columns from the source table, they could do so like this
SELECT
{{ dbt_utils.star(from=ref('fruits'), except=['color']) }},
{{ dbt_ml_preprocessing.one_hot_encoder({{ ref('fruits'), 'color' }} ) }}
FROM {{ ref('fruits') }}