duckdb Support PIVOT statement

We want to support PIVOT/UNPIVOT in DuckDB. The base syntax could be similar to the syntax used in SQL server, e.g.:

SELECT *
FROM  Product
PIVOT  
(  
  AVG(StandardCost)  
  FOR DaysToManufacture IN ([0], [1], [2], [3], [4])  
)

Cost_Sorted_By_Production_Days 0           1           2           3           4         
------------------------------ ----------- ----------- ----------- ----------- -----------
AverageCost                    5.0885      223.88      359.1082    NULL

The main problem here is that the user needs to specify the values of the column manually. This is necessary since the database engine needs to know which columns are present in a result set, which would otherwise be data dependent.

Create Pivot

An alternative solution to this would be adding an explicit CREATE PIVOT statement that gathers the required values in a separate stage. The pivot can then be used inside queries, e.g.:

CREATE PIVOT product_pivot AS Product(DaysToManufacture);

SELECT *
FROM  Product
PIVOT  
(  
  AVG(StandardCost)  
  USING product_pivot
)

When the base table (Product) changes, the pivot element should either be invalidated or any new values should simply be ignored. Likely the second option (new values will be ignored).

We could also add a top-level short hand that does both the CREATE PIVOT and the pivot statement, e.g.:

-- short hand for the above two SQL queries
PIVOT Product SELECT AVG(StandardCost) USING DaysToManufacture

Unpivot

We would also need an equivalent UNPIVOT statement:

SELECT *
FROM   
   (SELECT VendorID, Emp1, Emp2, Emp3, Emp4, Emp5  
   FROM pvt) p  
UNPIVOT  
   (Orders FOR Employee IN   
      (Emp1, Emp2, Emp3, Emp4, Emp5)  
)AS unpvt;

Jul 28 '21 06:07 Mytherin

Pivoting/unpivoting on multiple fields is also useful. Here is a short example I like to use. The syntax is from my textform tool:

Fold(p, ('Sales 1992', 'Sales 1993', 'Sales 1994',), ('Year', 'Sales',))
Fold(p, ('Sales 1992', 'Sales 1993', 'Sales 1994', 'Profit 1992', 'Profit 1993', 'Profit 1994',), ('Year', 'Sales', 'Profit',))

Unfold(p, ('Year', 'Sales',), ('Sales 1992', 'Sales 1993', 'Sales 1994',))
Unfold(p, ('Year', 'Sales', 'Profit',), ('Sales 1992', 'Sales 1993', 'Sales 1994', 'Profit 1992', 'Profit 1993', 'Profit 1994',))

Jul 28 '21 18:07 hawkfish

The Pivot syntax you outlined as shorthand sounds really powerful! Handling dynamic pivoted-out columns is really the key super power most DBs lack.

The non-shorthand syntax aligns well with SQL Server and it looks like Oracle also. However, that syntax still feels a bit tricky to me (just my opinion!). The way that fits best in my brain is more of a pivot function approach, somewhat like PostgreSQL's crosstab (which I don't particularly like!), but closer to Excel and Pandas. Essentially, specifying the columns you want on the rows, the columns you want on the columns, the columns to summarize, and the summary metrics (maybe the summary values and metrics can be combined Ex: AVG(column_x)?). The best place for that might be in the Python relational API, but I just figured I would throw this out there as an idea, and draw the comparison in approach to PostgreSQL's crosstab. That is at least one example of a function approach vs. the Pivot clause.

The approach you outlined would still be super powerful though!

Jul 28 '21 18:07 Alex-Monahan

This syntax is from a Python library and is a bit tricky/close to the metal to use - it was more to illustrate functionality than syntax! But maybe we can extend what @Mytherin suggested for UNPIVOT

SELECT *
FROM   
   (SELECT Employee, "Sales 1992", "Sales 1993", "Sales 1994", "Profit 1992", "Profit 1993", "Profit 1994" 
   FROM pvt) p  
UNPIVOT  (
   Years FOR 
        Sales IN  ("Sales 1992", "Sales 1993", "Sales 1994"), 
        Profit IN ("Profit 1992", "Profit 1993", "Profit 1994" )
) AS unpvt;

Unpivoting (folding in the cleaning syntax, yes it is confusing) is pretty straightforward, but as you say, pivoting (unfolding) needs to summarise the unreferenced columns somehow. I didn't deal with that in textform because I was assuming that the unreferenced values were identical, so that was effectively using FIRST.

Jul 28 '21 19:07 hawkfish

I actually made a pivot statement using duckdb a year ago, see with some examples at https://github.com/luxint/duckdb

This was using a lisp but for duckdb a natural and easy to understand syntax would be something like : PIVOT SELECT * FROM sales ROWS(region, country) COLUMNS(order_year ) VALUES(count(order_no) as orders, sum(revenue) as revenue), so like Alex-Monaham is also saying, mentioning which fields you want on the rows and which on the columns and then how you want to aggregate your values.

Jul 29 '21 19:07 luxint

Nice! Being able to pivot data and do crosstabs is at the core of OLAP, in my experience. The most natural way for me (after trying various previously available options with MSSQL, pgsql and mysql/mariadb) is not to use the (non-ansi92 standard PIVOT) statements (sometimes) supported by some of those engines. Mostly because it becomes cumbersome to use dynamic column values, resulting in long queries, difficult to maintain.

I understand the wish to support existing PIVOT syntax as supported elsewhere already but wonder if there could be more powerful and better solutions, such as the one shown by @luxint at https://github.com/luxint/duckdb (added link again because the one in the earlier comment did not lead to the right repo when clicked).

For my part, these days I prefer using the API supported by tidyr for pivotting. There is a great write-up here https://tidyr.tidyverse.org/articles/pivot.html for most various common use cases for reshaping tabular data there with PIVOT_WIDER and PIVOT_LONGER.

In my opinion is much more convenient to work with this API and there is mostly no need to hard code column names or field values, and it feels more natural to work stepwise with reshaping data and aggregations.

Jul 31 '21 15:07 mskyttner

Nice! Being able to pivot data and do crosstabs is at the core of OLAP, in my experience. The most natural way for me (after trying various previously available options with MSSQL, pgsql and mysql/mariadb) is not to use the (non-ansi92 standard PIVOT) statements (sometimes) supported by some of those engines. Mostly because it becomes cumbersome to use dynamic column values, resulting in long queries, difficult to maintain.

I understand the wish to support existing PIVOT syntax as supported elsewhere already but wonder if there could be more powerful and better solutions, such as the one shown by @luxint at https://github.com/luxint/duckdb (added link again because the one in the earlier comment did not lead to the right repo when clicked).

For my part, these days I prefer using the API supported by tidyr for pivotting. There is a great write-up here https://tidyr.tidyverse.org/articles/pivot.html for most various common use cases for reshaping tabular data there with PIVOT_WIDER and PIVOT_LONGER.

In my opinion is much more convenient to work with this API and there is mostly no need to hard code column names or field values, and it feels more natural to work stepwise with reshaping data and aggregations.

Just wanted wholly support this comment. I agree. The pivot syntax linked above is best.

Jul 09 '22 19:07 alexhallam

Some more votes for the names pivot_longer and pivot_wider. See Discord discussion.

Oct 26 '22 15:10 rsund

So, IMO, pivoting is a rendering thing. The real issue is limiting the dataset so that you can fill a grid appropriately. When pivoting two high cardinality dimensions (or sets of dimensions), you need to guarantee that you can fill the table.

Say, you want the top 3 customers by the top 3 products. Pivoting all your customers or your products will certainly blow up the column limit of things. How do you return a 3 x 3 grid that has the right data? Rendering isn't the problem. Easily getting to the data is the problem.

Adding a modifier to a dimension set is really what you want. The syntax below is probably crap, but basically you need to specify an ordering and a limit for each set of dimensions. You probably want to have an 'OTHER' for values beyond the limit...

SELECT
   customer_id,
   product_id,
   sum(item_value) as revenue
FROM orders 
GROUP BY 3
PIVOT 
   customer_id ORDER BY sum(item_value) LIMIT 3 
BY 
   product_id ORDER BY sum(item_value) LIMIT 3

Nov 30 '22 14:11 lloydtabb

Would using a limit prior to the pivot step achieve something similar? You could do the aggregation, apply the limit, then use the pivot to just move the values out onto the columns. Would that work ok for your specific use case?

Nov 30 '22 15:11 Alex-Monahan

This is implemented in #6387 now

Feb 21 '23 20:02 Mytherin