Default implementation of `deduplicate` is not null-safe
Describe the bug
@belasobral93 discovered that deduplicate doesn't work for Spark when any of the order_by columns are null.
Root cause
The root cause is that Spark defaults to NULLS FIRST for the null_sort_order for the ORDER BY clause.
Why dbt_utils rather than spark_utils?
Explanation
But
dbt_utilsdoesn't officially support Spark, so why is this being reported here (instead ofspark_utils)?
dbt_utils is providing five implementations for deduplicate:
- default
- redshift
- postgres
- snowflake
- bigquery
Since dbt_utils only tests against the four databases listed above and each of those has an override, it means there is no testing for the default implementation which the other dbt adapters inherit (like Spark).
Steps to reproduce
SQL example
The current implementation essentially acts like the following example:
with relation as (
select 1 as partition_by, 'a' as order_by
union all
select 1 as partition_by, 'a' as order_by
union all
select 2 as partition_by, 'a' as order_by
union all
select 2 as partition_by, 'b' as order_by
union all
select 2 as partition_by, NULL as order_by
),
row_numbered as (
select
_inner.*,
row_number() over (
partition by partition_by
order by order_by
) as rn
from relation as _inner
)
select
distinct relation.*
from relation
natural join row_numbered
where row_numbered.rn = 1
;
Expected results
| partition_by | order_by |
|---|---|
| 1 | a |
| 2 | a |
Actual results
| partition_by | order_by |
|---|---|
| 1 | a |
Screenshots and log output
Not provided.
System information
Not provided.
Which database are you using dbt with?
- [ ] postgres
- [ ] redshift
- [ ] bigquery
- [ ] snowflake
- [x] other: spark
The output of dbt --version:
Not provided.
Additional context
To make the fix, it might be as simple as:
- adding
nulls lastafterorder_by(when anorder byclause exists, of course)
To confirm the bug and establish sufficient test cases, we could:
- comment out all adapter-specific overrides for
deduplicate - re-run the test suite
- add test cases as needed
- re-run the test suite
- make fixes to the implementation(s)
- re-run the test suite
Are you interested in contributing the fix?
I will contribute the fix.
To make the fix, it might be as simple as:
- adding
nulls lastafterorder_by(when anorder byclause exists, of course)
Is nulls [last|first] a well-known part of the SQL spec? I've never heard of it, a bit of quick Googling implies that it's specific-ish
A couple of things that are only in my head/true by convention but not documented:
- Deviations from the norm (
default__implementations) should be handled in an override. (Strongly held) - "The norm" is vanilla-flavoured things that can reasonably be expected of pretty much everything. 1 (Pretty strongly held)
- I tend to assume that Postgres is vanilla-flavoured, as the longest-lived and only-OSS DB in the original core four. (Not at all strongly held, but don't have a better idea) 1 Lookin' at you, BigQuery types
The approach that would be truest to points 1 and 3 would be:
- The
defaultimplementation changes to Postgres' - Anyone who can't do that builds their own
Which is right if we want a globally coherent and understandable set of principles. It's not super pragmatic though; it appears that 10 data platforms support our current implementation, and as far as I can see most of them would not support Postgres' because they don't have distinct on.
In my opinion, "Vanilla as default" is more important than "every default implementation is what would work best in Postgres". (In fact, this might be a case where the default implementation works on PGSQL, but there is a special optimisation that led to the custom version? I haven't checked).
So that's a lot of words to say that if MySQL, Redshift, Materialize, SingleStore don't understand nulls last then we are inconveniencing some new adapters by throwing it in, and maybe we're better off just specifying nulls last in the Spark version.
Related (in that it's also a deduplicate issue): https://github.com/dbt-labs/dbt-utils/issues/713
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
anyone any workarounds for this