SQL: Subqueries
SQL subqueries come in several varieties, but as of super commit 7f23f65, we have yet to add support for even the simplest, e.g.,
$ super -version
Version: 7f23f6560
$ super -c "SELECT 1+(SELECT 1);"
parse error at line 1, column 18:
SELECT 1+(SELECT 1);
=== ^ ===
Details
Repro is with super commit 7f23f65. The query shown above is taken from a sqllogictest. The same sqllogictest file contains many other examples of subqueries.
The same query working as expected in Postgres:
$ psql postgres
psql (17.5 (Homebrew))
Type "help" for help.
postgres=# SELECT 1+(SELECT 1);
?column?
----------
2
(1 row)
Here is verification as of super commit cc6b1e6 of the initial subquery support added in #6070.
Revisiting the simple repro query shown above, we now see:
$ super -version
Version: cc6b1e6c2
$ super -c "SELECT 1+(SELECT 1);"
{"1+(\n unknown operator: *ast.Select\n)":2}
The error-looking unknown operator message in the generated field name threw me for a loop initially, but the correct result value of 2 is indeed intact. I've opened separate issue #6073 to track how we might improve there.
At the macro level, the benefits of the change are observable by comparing sqllogictest runs at super commit cf318c7 (right before the merge of #6070) with runs of the same at super commit cc6b1e6 (right after the merge of #6070).
| sqllogictest suite | Total test cases | Parse errors at cf318c7 | Parse errors at cc6b1e6 | Successes at cf318c7 | Successes at cc6b1e6 |
|---|---|---|---|---|---|
| select1 | 1000 | 525 | 117 | 383 | 485 |
| select2 | 1000 | 531 | 100 | 294 | 373 |
| select3 | 3320 | 1792 | 492 | 1003 | 1241 |
As there's thousands of test cases in here it's a difficult to account for every difference, but from previously browsing failures it seemed that most/all of the parse errors previously observed were due to lack of support for subqueries, so we can see these have been reduced significantly with the remaining parse errors likely due to the remaining subquery coverage noted in the #6070 description. We similarly see an increase in the number of test cases that now successfully produce the expected result. Not every newly runnable query is successful though, since in some cases our ability to get past the parse phase reveals a new problem to study.
Since more subquery improvements are expected, I'll keep this one open to further verify the additional improvements as they arrive. Thanks @mattnibs!
To mark our continued progress toward comprehensive subquery support, here's an example showing the first pass of correlated subquery support in pipe syntax added in #6100 to deliver the equivalent of a correlated subquery in SQL.
Here's the baseline in SQL, which uses a correlated subquery to calculate an average salary for each department and place that alongside per-employee details.
$ duckdb --version
v1.3.2 (Ossivalis) 0b83e5d2f6
$ duckdb -c "SELECT * FROM employees.parquet;"
┌───────┬─────────┬─────────┬────────┐
│ id │ name │ dept │ salary │
│ int32 │ varchar │ varchar │ int32 │
├───────┼─────────┼─────────┼────────┤
│ 1 │ Alice │ HR │ 50000 │
│ 2 │ Bob │ HR │ 55000 │
│ 3 │ Cara │ IT │ 70000 │
│ 4 │ Dave │ IT │ 75000 │
│ 5 │ Eve │ Sales │ 60000 │
└───────┴─────────┴─────────┴────────┘
$ duckdb -c "
SELECT e1.name,
e1.dept,
e1.salary,
(
SELECT AVG(e2.salary)
FROM 'employees.parquet' e2
WHERE e2.dept = e1.dept
) AS avg_dept_salary
FROM 'employees.parquet' e1;"
┌─────────┬─────────┬────────┬─────────────────┐
│ name │ dept │ salary │ avg_dept_salary │
│ varchar │ varchar │ int32 │ double │
├─────────┼─────────┼────────┼─────────────────┤
│ Alice │ HR │ 50000 │ 52500.0 │
│ Bob │ HR │ 55000 │ 52500.0 │
│ Cara │ IT │ 70000 │ 72500.0 │
│ Dave │ IT │ 75000 │ 72500.0 │
│ Eve │ Sales │ 60000 │ 60000.0 │
└─────────┴─────────┴────────┴─────────────────┘
In pipe syntax, unnest is the key ingredient, as it:
- Brings the per-employee outer value of
deptinto the subquery via the first field in its<expr> - Provides the data for computing our aggregations as a subquery that pulls
fromour input file, so this becomes the array expected by the second field inunnest's<expr>, which is immediately unpacked byunnest
$ super -version
Version: 857796aa7
$ super -c "
from employees.parquet
| values {
name,
dept,
salary, avg_dept_salary: (
unnest {outer_dept:dept, data:(from employees.parquet)}
| avg(data.salary) where data.dept==outer_dept
)
}"
{name:"Alice",dept:"HR",salary:50000::int32,avg_dept_salary:52500.}
{name:"Bob",dept:"HR",salary:55000::int32,avg_dept_salary:52500.}
{name:"Cara",dept:"IT",salary:70000::int32,avg_dept_salary:72500.}
{name:"Dave",dept:"IT",salary:75000::int32,avg_dept_salary:72500.}
{name:"Eve",dept:"Sales",salary:60000::int32,avg_dept_salary:60000.}
@mccanne confirmed for me that the time spent as an array and being in expression context means potentially bumping up current per-value memory limits, but there may be ways we can improve on this in the future.
Talking this through did highlight one new gotcha: The subquery (from employees.parquet) has an effect similar to the one highlighted in the opening text of #6096 where it returns an array only when there's multiple values, e.g.,
$ super -c "values {data:(from employees.parquet)}"
{data:[{id:1::int32,name:"Alice",dept:"HR",salary:50000::int32},{id:2::int32,name:"Bob",dept:"HR",salary:55000::int32},{id:3::int32,name:"Cara",dept:"IT",salary:70000::int32},{id:4::int32,name:"Dave",dept:"IT",salary:75000::int32},{id:5::int32,name:"Eve",dept:"Sales",salary:60000::int32}]}
$ super -s OneEmployee.parquet
{id:1::int32,name:"Alice",dept:"HR",salary:50000::int32}
$ super -c "values {data:(from OneEmployee.parquet)}"
{data:{id:1::int32,name:"Alice",dept:"HR",salary:50000::int32}}
Since the second field of unnest needs to be an array, this means that if the user's data happened to be of this degenerate form, the query that previously succeeded would now fail to produce the per-department average salary.
$ super -c "
from OneEmployee.parquet
| values {
name,
dept,
salary, avg_dept_salary: (
unnest {outer_dept:dept, data:(from OneEmployee.parquet)}
| avg(data.salary) where data.dept==outer_dept
)
}"
{name:"Alice",dept:"HR",salary:50000::int32,avg_dept_salary:null::float64}
Of course, if the user knows their data well enough to expect this, they could add the [ ] to bundle the subquery result into an array.
$ super -c "
from OneEmployee.parquet
| values {
name,
dept,
salary, avg_dept_salary: (
unnest {outer_dept:dept, data:[(from OneEmployee.parquet)]}
| avg(data.salary) where data.dept==outer_dept
)
}"
{name:"Alice",dept:"HR",salary:50000::int32,avg_dept_salary:50000.}
However, since this may be an unreasonable burden on the user, it may lead us to flip the current behavior shown in https://github.com/brimdata/super/issues/6096#issuecomment-3176495212 such that in dataflow scope the results from such subqueries would always be bundled as arrays even with single values, since that would allow these to slide consistently into unnest like this.