SQL-compatible UNION behavior (e.g., column-based)

Open philrz opened this issue 11 months ago • 0 comments

tl;dr

We've reached consensus that the initial GA release of SuperDB will include the UNION ALL support added in #5735, which is in some ways not fully compatible with typical SQL behavior, such as how SuperDB's is effectively "union by name" while SQL's is "union by column".

Details

At the time this issue is being filed, super is at commit a6eae50. The example shown below is likely not the only area where SuperSQL's current UNION behavior differs somewhat from that of traditional SQL.

This is illustrated using an abbreviated example of something from the w3schools SQL tutorial.

Given input data files:

$ cat customers.csv 
City,Country
Berlin,Germany
México D.F.,Mexico
México D.F.,Mexico

$ cat suppliers.csv 
City,Country
Londona,UK
New Orleans,USA

Here's UNION ALL behaving as expected.

$ super -version
Version: v1.18.0-355-ga6eae509

$ super -c "
SELECT City,Country FROM customers.csv
UNION ALL
SELECT City,Country FROM suppliers.csv
ORDER BY City;"

{City:"Berlin",Country:"Germany"}
{City:"México D.F.",Country:"Mexico"}
{City:"México D.F.",Country:"Mexico"}
{City:"Londona",Country:"UK"}
{City:"New Orleans",Country:"USA"}

However, if we swap the order of the fields in the second table, we see the relative order of the fields in the output remains the same as in the respective input tables, and the city values remain with the City field names and country values remain with Country field names.

$ super -c "
SELECT City,Country FROM customers.csv
UNION ALL
SELECT Country,City FROM suppliers.csv
ORDER BY City;"

{City:"Berlin",Country:"Germany"}
{City:"México D.F.",Country:"Mexico"}
{City:"México D.F.",Country:"Mexico"}
{Country:"UK",City:"Londona"}
{Country:"USA",City:"New Orleans"}

By comparison, in a tool like DuckDB that follows traditional SQL behavior, the names in the output table are based on the first input table while the values land in relative column order, such that the country values from the second table are shown under the column header for City.

$ duckdb -c "
SELECT City,Country FROM customers.csv
UNION ALL
SELECT Country,City FROM suppliers.csv
ORDER BY City;"
┌─────────────┬─────────────┐
│    City     │   Country   │
│   varchar   │   varchar   │
├─────────────┼─────────────┤
│ Berlin      │ Germany     │
│ México D.F. │ Mexico      │
│ México D.F. │ Mexico      │
│ UK          │ Londona     │
│ USA         │ New Orleans │
└─────────────┴─────────────┘

Putting aside any debate about which behavior might be more intuitive, since our goal is for SuperSQL to be as Postgres-compatible as possible, we'll want to follow this behavior by default in SuperSQL.

While the following doesn't provide the behavior that matches traditional SQL, it should be noted that tacking on the fuse operator does adjust the relative order of SuperDB's output to appear columnar, though the city and country values remain under their original field names.

$ super -c "
SELECT City,Country FROM customers.csv
UNION ALL
SELECT Country,City FROM suppliers.csv
ORDER BY City
| fuse"

{City:"Berlin",Country:"Germany"}
{City:"México D.F.",Country:"Mexico"}
{City:"México D.F.",Country:"Mexico"}
{City:"Londona",Country:"UK"}
{City:"New Orleans",Country:"USA"}

Apr 01 '25 20:04 philrz