spark [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

What changes were proposed in this pull request?

This adds UNPIVOT clause to SQL syntax. It follows the same syntax as BigQuery, T-SQL, Oracle:

FROM ... [ unpivot_clause ]

unpivot_clause:
    UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ] (
        { single_value_column_unpivot | multi_value_column_unpivot }
    ) [[AS] alias]

single_value_column_unpivot:
    values_column
    FOR name_column
    IN (unpivot_column [, ...])

multi_value_column_unpivot:
    (values_column [, ...])
    FOR name_column
    IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

unpivotColumn:
    namedExpression

For example:

CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
    (2020, null, 1000, 2000, 2500),
    (2021, 2250, 3200, 4200, 5900),
    (2022, 4200, 3100, null, null);

SELECT * FROM sales_quarterly
    UNPIVOT (
        sales FOR quarter IN (q1, q2, q3, q4)
    );

SELECT up.* FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        sales FOR quarter IN (q1 AS `Q1`, q2 AS `Q2`, q3 AS `Q3`, q4 AS `Q4`)
    ) AS `up`;

SELECT * FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        (first_quarter, second_quarter)
        FOR half_of_the_year IN (
            (q1, q2) AS `H1`,
            (q3, q4) AS `H2`
        )
    );

Why are the changes needed?

To support Dataset.unpivot in SQL queries.

Does this PR introduce any user-facing change?

Yes, adds UNPIVOT to SQL syntax.

How was this patch tested?

Added end-to-end tests to SQLQueryTestSuite.

Aug 04 '22 15:08 EnricoMi

Can one of the admins verify this patch?

Aug 04 '22 18:08 AmplabJenkins

cc @maryannxue FYI

Aug 05 '22 00:08 HyukjinKwon

@cloud-fan @MaxGekk @HyukjinKwon @gengliangwang @zhengruifeng what do you think?

Aug 27 '22 11:08 EnricoMi

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

Aug 29 '22 16:08 cloud-fan

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

I have added the syntax and examples from docs/sql-ref-syntax-qry-select-unpivot.md to the PR description.

Aug 30 '22 15:08 EnricoMi

@cloud-fan I have introduced expression UnpivotExpr to replace the (Seq[NamedExpression], Option[String]]), which makes code more readable.

But, this introduces the following change in behaviour / deviation from projection behaviour:

spark.range(5).select(struct($"id").as("an")).select($"an.id").show()

"an.id" gets alias "id":

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

Project(UnresolvedAttribute("an.id"), plan)
  --> ResolveReferences rule -->
Project(Alias(GetStructField(an#2.id), "id"), plan)

spark.range(5).select(struct($"id").as("an")).unpivot(Array($"an.id"), Array($"an.id"), "col", "val").show()

before introducing UnpivotExpr, both ids and values get alias "id" (as in select / Project):

+---+---+---+
| id|col|val|
+---+---+---+
|  0| id|  0|
|  1| id|  1|
|  2| id|  2|
|  3| id|  3|
|  4| id|  4|
+---+---+---+

after introducing UnpivotExpr, id "str.id" gets alias "id", value "str.id" does not get an alias and hence gets name "an.id":

+---+-----+---+
| id|  col|val|
+---+-----+---+
|  0|an.id|  0|
|  1|an.id|  1|
|  2|an.id|  2|
|  3|an.id|  3|
|  4|an.id|  4|
+---+-----+---+

Now that UnpivotExpr is the top level expression, inner UnresolvedAttribute / GetStructField does not get an alias:

Unpivot(Seq(UnresolvedAttribute("an.id")), Seq(UnpivotExpr(Seq(UnresolvedAttribute("an.id")), ...)), ..., plan)
  --> ResolveReferences -->
Unpivot(Seq(Alias(GetStructField(an#2.id), "id")), Seq(UnpivotExpr(Seq(GetStructField(an#2.id)), ...)), ..., plan)

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1770

CleanupAliases rule is not the reason, the alias is being removed inside ResolveReferences.

The only way to get to the old behaviour is a special treatment of UnpivotExpr in QueryPlan.mapExpressions.recursiveTransform: https://github.com/apache/spark/pull/37407/commits/9dd66b78ec817a53325d95900f18198dac9bc3b1#diff-ece55283a94dd23d3c04f8b9d8ae35937ccff67724be690ff30f76e9f8093c6eR211

Sep 16 '22 08:09 EnricoMi

All green: https://github.com/G-Research/spark/actions/runs/3196590413/jobs/5220447028

Status update may have failed.

Oct 06 '22 13:10 EnricoMi

Thanks for the excellent code review and guidance!

Oct 06 '22 13:10 EnricoMi

thanks, merging to master!

Oct 07 '22 12:10 cloud-fan

spark spark copied to clipboard

[SPARK-39876][SQL] Add UNPIVOT to SQL syntax

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard