spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-39876][SQL] Add UNPIVOT to SQL syntax

Open EnricoMi opened this issue 2 years ago • 6 comments

What changes were proposed in this pull request?

This adds UNPIVOT clause to SQL syntax. It follows the same syntax as BigQuery, T-SQL, Oracle:

FROM ... [ unpivot_clause ]

unpivot_clause:
    UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ] (
        { single_value_column_unpivot | multi_value_column_unpivot }
    ) [[AS] alias]

single_value_column_unpivot:
    values_column
    FOR name_column
    IN (unpivot_column [, ...])

multi_value_column_unpivot:
    (values_column [, ...])
    FOR name_column
    IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

unpivotColumn:
    namedExpression

For example:

CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
    (2020, null, 1000, 2000, 2500),
    (2021, 2250, 3200, 4200, 5900),
    (2022, 4200, 3100, null, null);

SELECT * FROM sales_quarterly
    UNPIVOT (
        sales FOR quarter IN (q1, q2, q3, q4)
    );

SELECT up.* FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        sales FOR quarter IN (q1 AS `Q1`, q2 AS `Q2`, q3 AS `Q3`, q4 AS `Q4`)
    ) AS `up`;

SELECT * FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        (first_quarter, second_quarter)
        FOR half_of_the_year IN (
            (q1, q2) AS `H1`,
            (q3, q4) AS `H2`
        )
    );

Why are the changes needed?

To support Dataset.unpivot in SQL queries.

Does this PR introduce any user-facing change?

Yes, adds UNPIVOT to SQL syntax.

How was this patch tested?

Added end-to-end tests to SQLQueryTestSuite.

EnricoMi avatar Aug 04 '22 15:08 EnricoMi

Can one of the admins verify this patch?

AmplabJenkins avatar Aug 04 '22 18:08 AmplabJenkins

cc @maryannxue FYI

HyukjinKwon avatar Aug 05 '22 00:08 HyukjinKwon

@cloud-fan @MaxGekk @HyukjinKwon @gengliangwang @zhengruifeng what do you think?

EnricoMi avatar Aug 27 '22 11:08 EnricoMi

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

cloud-fan avatar Aug 29 '22 16:08 cloud-fan

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

I have added the syntax and examples from docs/sql-ref-syntax-qry-select-unpivot.md to the PR description.

EnricoMi avatar Aug 30 '22 15:08 EnricoMi

@cloud-fan I have introduced expression UnpivotExpr to replace the (Seq[NamedExpression], Option[String]]), which makes code more readable.

But, this introduces the following change in behaviour / deviation from projection behaviour:

spark.range(5).select(struct($"id").as("an")).select($"an.id").show()

"an.id" gets alias "id":

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+
Project(UnresolvedAttribute("an.id"), plan)
  --> ResolveReferences rule -->
Project(Alias(GetStructField(an#2.id), "id"), plan)
spark.range(5).select(struct($"id").as("an")).unpivot(Array($"an.id"), Array($"an.id"), "col", "val").show()

before introducing UnpivotExpr, both ids and values get alias "id" (as in select / Project):

+---+---+---+
| id|col|val|
+---+---+---+
|  0| id|  0|
|  1| id|  1|
|  2| id|  2|
|  3| id|  3|
|  4| id|  4|
+---+---+---+

after introducing UnpivotExpr, id "str.id" gets alias "id", value "str.id" does not get an alias and hence gets name "an.id":

+---+-----+---+
| id|  col|val|
+---+-----+---+
|  0|an.id|  0|
|  1|an.id|  1|
|  2|an.id|  2|
|  3|an.id|  3|
|  4|an.id|  4|
+---+-----+---+

Now that UnpivotExpr is the top level expression, inner UnresolvedAttribute / GetStructField does not get an alias:

Unpivot(Seq(UnresolvedAttribute("an.id")), Seq(UnpivotExpr(Seq(UnresolvedAttribute("an.id")), ...)), ..., plan)
  --> ResolveReferences -->
Unpivot(Seq(Alias(GetStructField(an#2.id), "id")), Seq(UnpivotExpr(Seq(GetStructField(an#2.id)), ...)), ..., plan)

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1770

CleanupAliases rule is not the reason, the alias is being removed inside ResolveReferences.

The only way to get to the old behaviour is a special treatment of UnpivotExpr in QueryPlan.mapExpressions.recursiveTransform: https://github.com/apache/spark/pull/37407/commits/9dd66b78ec817a53325d95900f18198dac9bc3b1#diff-ece55283a94dd23d3c04f8b9d8ae35937ccff67724be690ff30f76e9f8093c6eR211

EnricoMi avatar Sep 16 '22 08:09 EnricoMi

All green: https://github.com/G-Research/spark/actions/runs/3196590413/jobs/5220447028

Status update may have failed.

EnricoMi avatar Oct 06 '22 13:10 EnricoMi

Thanks for the excellent code review and guidance!

EnricoMi avatar Oct 06 '22 13:10 EnricoMi

thanks, merging to master!

cloud-fan avatar Oct 07 '22 12:10 cloud-fan