spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-52462] [SQL] Enforce type coercion before children output deduplication in Union

Open mihailoale-db opened this issue 6 months ago • 1 comments

What changes were proposed in this pull request?

Right now, query the following query produces plans that are not consistent over different underlying table providers. Query:

SELECT col1, col2, col3, NULLIF('','') AS col4
FROM table
UNION ALL
SELECT col2, col2, null AS col3, col4
FROM table;

This happens because of rule ordering:

  • Sometimes: WidenSetOperationTypes -> ... -> ResolveReferences (deduplication of Union children outputs)
  • Sometimes: ResolveReferences (deduplication of Union children outputs) -> ... -> WidenSetOperationTypes

In this issue I propose that we align those two by enforcing type coercion to happen before deduplication.

Why are the changes needed?

To make UNION with different underlying table providers producing consistent plans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added tests + existing ones.

Was this patch authored or co-authored using generative AI tooling?

No.

mihailoale-db avatar Jun 12 '25 19:06 mihailoale-db

@mihailoale-db can you say more about how your example query gets a different type coercion result with different rule order? Let's describe "not consistent" clearly here.

cloud-fan avatar Jun 13 '25 23:06 cloud-fan

@cloud-fan Some third party data sources may add custom analyzer rules that will change the rule order here. Delta Lake is an example. Let me mention that in the description. Thanks!

mihailoale-db avatar Jun 16 '25 08:06 mihailoale-db

@cloud-fan all the tests passed. PTAL when you have time. Thanks!

mihailoale-db avatar Jun 23 '25 11:06 mihailoale-db

thanks, merging to master!

cloud-fan avatar Jun 23 '25 12:06 cloud-fan