Methods to achieve null safety for `deduplicate`
resolves #814 resolves #621
This is a bug fix with no breaking changes.
It also adds two new features:
- optional
row_aliaskeyword argument (type: string, default:none) - optional
columnskeyword argument (type: list, default:none)
Description & motivation
This PR is still in draft status, and more description will be added at a later date.
In the meantime, see #814 (and everything it links to, in particular #713) for background motivation and discussion to-date.
As a summary, this PR gives the user multiple options to achieve null safety for deduplicate:
- The user passes a relation that has been materialized in the database (i.e., it isn't an ephemeral model)
- https://github.com/dbt-labs/dbt-utils/pull/815/commits/fe03f43cc5f57d86006addfa1770ab78230cd725 -- as long as the
relationis not a CTE, it's columns can be fetched via theget_filtered_columns_in_relationmacro
- https://github.com/dbt-labs/dbt-utils/pull/815/commits/fe03f43cc5f57d86006addfa1770ab78230cd725 -- as long as the
- The user passes the
row_aliaskeyword argument- https://github.com/dbt-labs/dbt-utils/pull/815/commits/3eced4d100f8e7fe91827678bd321c242718ed94 -- when the
row_aliaskeyword argument is set, then we can deduplicate via therow_number()window function (at the cost of therow_aliasbeing an extra column that wasn't in the original data set)
- https://github.com/dbt-labs/dbt-utils/pull/815/commits/3eced4d100f8e7fe91827678bd321c242718ed94 -- when the
- The user passes the
columnskeyword argument- https://github.com/dbt-labs/dbt-utils/pull/815/commits/d46676e98bd3e0c5f3d4e5056ff83c765720a3a0 -- when
columnskeyword argument is set, then we can deduplicate via therow_number()window function and only return the requested columns
- https://github.com/dbt-labs/dbt-utils/pull/815/commits/d46676e98bd3e0c5f3d4e5056ff83c765720a3a0 -- when
Outside of those options, the deduplication will not be null-safe.
Option 1
models/my_model_1.sql
{{ config(materialized="table") }}
select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version
models/deduped_1.sql
{{
dbt_utils.deduplicate(
ref('my_model_1'),
partition_by='user_id, created_at',
order_by='version desc'
) | indent
}}
dbt build -s +deduped_1
dbt show -s deduped_1
| user_id | created_at | version |
|---|---|---|
| 1 | 2 |
Option 2
models/my_model_2.sql
{{ config(materialized="ephemeral") }}
select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version
models/deduped_2.sql
{{
dbt_utils.deduplicate(
ref('my_model_2'),
partition_by='user_id, created_at',
order_by='version desc',
row_alias='rn'
) | indent
}}
dbt build -s +deduped_2
dbt show -s deduped_2
| user_id | created_at | version | rn |
|---|---|---|---|
| 1 | 2 | 1 |
Option 3
models/my_model_3.sql
{{ config(materialized="ephemeral") }}
select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version
models/deduped_3.sql
{{
dbt_utils.deduplicate(
ref('my_model_3'),
partition_by='user_id, created_at',
order_by='version desc',
columns=['user_id', 'created_at', 'version']
) | indent
}}
dbt build -s +deduped_3
dbt show -s deduped_3
| user_id | created_at | version |
|---|---|---|
| 1 | 2 |
Option 4
models/my_model_4.sql
{{ config(materialized="ephemeral") }}
select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version
[!WARNING]
This is the one not guaranteed to be null-safe (depending on the adapter).
models/deduped_4.sql
{{
dbt_utils.deduplicate(
ref('my_model_4'),
partition_by='user_id, created_at',
order_by='version desc',
) | indent
}}
dbt build -s +deduped_4
dbt show -s deduped_4
Here's the warning that will be logged:
Warning: the implementation of the `deduplicate` macro for the `postgres` adapter is not null safe.
Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it as an extra column to the output).
e.g.,
{
dbt_utils.deduplicate(
'my_cte',
partition_by='user_id',
order_by='version desc',
row_alias='rn'
) | indent
}
Warning triggered by model: my_project.deduped_4
dbt project / package: my_project
path: models/deduped_4.sql
| user_id | created_at | version |
|---|
Key history of deduplicate macro
- https://github.com/dbt-labs/dbt-utils/issues/335
- https://github.com/dbt-labs/dbt-utils/pull/512
- https://github.com/dbt-labs/dbt-utils/pull/526
- https://github.com/dbt-labs/dbt-utils/pull/548
Checklist
- [x] I followed guidelines to ensure that my changes will work on "non-core" adapters
- [ ] 👈 I have updated the README.md (if applicable)
- [ ] 👈 I have added tests & descriptions to my models (and macros if applicable)
- [ ] 👈 I have added an entry to CHANGELOG.md
This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.