dbt-utils Default implementation of `deduplicate` is not null-safe

Describe the bug

@belasobral93 discovered that deduplicate doesn't work for Spark when any of the order_by columns are null.

Root cause

The root cause is that Spark defaults to NULLS FIRST for the null_sort_order for the ORDER BY clause.

Why `dbt_utils` rather than `spark_utils`?

Explanation

But dbt_utils doesn't officially support Spark, so why is this being reported here (instead of spark_utils)?

dbt_utils is providing five implementations for deduplicate:

default
redshift
postgres
snowflake
bigquery

Since dbt_utils only tests against the four databases listed above and each of those has an override, it means there is no testing for the default implementation which the other dbt adapters inherit (like Spark).

Steps to reproduce

SQL example

The current implementation essentially acts like the following example:

with relation as (

    select 1 as partition_by, 'a' as order_by
    union all
    select 1 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'a' as order_by
    union all
    select 2 as partition_by, 'b' as order_by
    union all
    select 2 as partition_by, NULL as order_by

),

row_numbered as (

    select
        _inner.*,
        row_number() over (
            partition by partition_by
            order by order_by
        ) as rn
    from relation as _inner

)

select
    distinct relation.*
from relation
natural join row_numbered
where row_numbered.rn = 1
;

Expected results

partition_by	order_by
1	a
2	a

Actual results

partition_by	order_by
1	a

Screenshots and log output

Not provided.

System information

Not provided.

Which database are you using dbt with?

[ ] postgres
[ ] redshift
[ ] bigquery
[ ] snowflake
[x] other: spark

The output of dbt --version: Not provided.

Additional context

To make the fix, it might be as simple as:

adding nulls last after order_by (when an order by clause exists, of course)

To confirm the bug and establish sufficient test cases, we could:

comment out all adapter-specific overrides for deduplicate
re-run the test suite
add test cases as needed
re-run the test suite
make fixes to the implementation(s)
re-run the test suite

Are you interested in contributing the fix?

I will contribute the fix.

Jul 16 '22 15:07 dbeatty10

To make the fix, it might be as simple as:

adding nulls last after order_by (when an order by clause exists, of course)

Is nulls [last|first] a well-known part of the SQL spec? I've never heard of it, a bit of quick Googling implies that it's specific-ish

A couple of things that are only in my head/true by convention but not documented:

Deviations from the norm (default__ implementations) should be handled in an override. (Strongly held)
"The norm" is vanilla-flavoured things that can reasonably be expected of pretty much everything. ¹ (Pretty strongly held)
I tend to assume that Postgres is vanilla-flavoured, as the longest-lived and only-OSS DB in the original core four. (Not at all strongly held, but don't have a better idea) ¹ Lookin' at you, BigQuery types

The approach that would be truest to points 1 and 3 would be:

The default implementation changes to Postgres'
Anyone who can't do that builds their own

Which is right if we want a globally coherent and understandable set of principles. It's not super pragmatic though; it appears that 10 data platforms support our current implementation, and as far as I can see most of them would not support Postgres' because they don't have distinct on.

In my opinion, "Vanilla as default" is more important than "every default implementation is what would work best in Postgres". (In fact, this might be a case where the default implementation works on PGSQL, but there is a special optimisation that led to the custom version? I haven't checked).

So that's a lot of words to say that if MySQL, Redshift, Materialize, SingleStore don't understand nulls last then we are inconveniencing some new adapters by throwing it in, and maybe we're better off just specifying nulls last in the Spark version.

Sep 14 '22 04:09 joellabes

Related (in that it's also a deduplicate issue): https://github.com/dbt-labs/dbt-utils/issues/713

Dec 05 '22 06:12 joellabes

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

Jul 29 '23 01:07 github-actions[bot]

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Aug 05 '23 01:08 github-actions[bot]

anyone any workarounds for this

Apr 27 '24 15:04 nitindatta

Default implementation of `deduplicate` is not null-safe

Describe the bug

Root cause

Why dbt_utils rather than spark_utils?

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

Why `dbt_utils` rather than `spark_utils`?