dbt-utils icon indicating copy to clipboard operation
dbt-utils copied to clipboard

Methods to achieve null safety for `deduplicate`

Open dbeatty10 opened this issue 2 years ago • 1 comments

resolves #814 resolves #621

This is a bug fix with no breaking changes.

It also adds two new features:

  • optional row_alias keyword argument (type: string, default: none)
  • optional columns keyword argument (type: list, default: none)

Description & motivation

This PR is still in draft status, and more description will be added at a later date.

In the meantime, see #814 (and everything it links to, in particular #713) for background motivation and discussion to-date.

As a summary, this PR gives the user multiple options to achieve null safety for deduplicate:

  1. The user passes a relation that has been materialized in the database (i.e., it isn't an ephemeral model)
    • https://github.com/dbt-labs/dbt-utils/pull/815/commits/fe03f43cc5f57d86006addfa1770ab78230cd725 -- as long as the relation is not a CTE, it's columns can be fetched via the get_filtered_columns_in_relation macro
  2. The user passes the row_alias keyword argument
    • https://github.com/dbt-labs/dbt-utils/pull/815/commits/3eced4d100f8e7fe91827678bd321c242718ed94 -- when the row_alias keyword argument is set, then we can deduplicate via the row_number() window function (at the cost of the row_alias being an extra column that wasn't in the original data set)
  3. The user passes the columns keyword argument
    • https://github.com/dbt-labs/dbt-utils/pull/815/commits/d46676e98bd3e0c5f3d4e5056ff83c765720a3a0 -- when columns keyword argument is set, then we can deduplicate via the row_number() window function and only return the requested columns

Outside of those options, the deduplication will not be null-safe.

Option 1

models/my_model_1.sql

{{ config(materialized="table") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_1.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_1'),
            partition_by='user_id, created_at',
            order_by='version desc'
        ) | indent
    }}
dbt build -s +deduped_1
dbt show -s deduped_1
user_id created_at version
1 2

Option 2

models/my_model_2.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_2.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_2'),
            partition_by='user_id, created_at',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }}
dbt build -s +deduped_2
dbt show -s deduped_2
user_id created_at version rn
1 2 1

Option 3

models/my_model_3.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

models/deduped_3.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_3'),
            partition_by='user_id, created_at',
            order_by='version desc',
            columns=['user_id', 'created_at', 'version']
        ) | indent
    }}
dbt build -s +deduped_3
dbt show -s deduped_3
user_id created_at version
1 2

Option 4

models/my_model_4.sql

{{ config(materialized="ephemeral") }}

select 1 as user_id, cast(null as date) as created_at, 1 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version union all
select 1 as user_id, cast(null as date) as created_at, 2 as version

[!WARNING]
This is the one not guaranteed to be null-safe (depending on the adapter).

models/deduped_4.sql

    {{
        dbt_utils.deduplicate(
            ref('my_model_4'),
            partition_by='user_id, created_at',
            order_by='version desc',
        ) | indent
    }}
dbt build -s +deduped_4
dbt show -s deduped_4

Here's the warning that will be logged:

Warning: the implementation of the `deduplicate` macro for the `postgres` adapter is not null safe. 
Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it as an extra column to the output).

e.g.,
    {
        dbt_utils.deduplicate(
            'my_cte',
            partition_by='user_id',
            order_by='version desc',
            row_alias='rn'
        ) | indent
    }

Warning triggered by model: my_project.deduped_4
dbt project / package: my_project
path: models/deduped_4.sql
user_id created_at version

Key history of deduplicate macro

  • https://github.com/dbt-labs/dbt-utils/issues/335
  • https://github.com/dbt-labs/dbt-utils/pull/512
  • https://github.com/dbt-labs/dbt-utils/pull/526
  • https://github.com/dbt-labs/dbt-utils/pull/548

Checklist

  • [x] I followed guidelines to ensure that my changes will work on "non-core" adapters
  • [ ] 👈 I have updated the README.md (if applicable)
  • [ ] 👈 I have added tests & descriptions to my models (and macros if applicable)
  • [ ] 👈 I have added an entry to CHANGELOG.md

dbeatty10 avatar Jul 26 '23 17:07 dbeatty10

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

github-actions[bot] avatar Jan 29 '24 01:01 github-actions[bot]