prql
prql copied to clipboard
Excessive intermediate name elision
What's up?
I'm writing this issue following these comments:
- "we should ... try and produce legible SQL",
- "the examples ... are the biggest selling point for PRQL, in particular the SQL it generates. It looks clean, straightforward, something I would've written myself",
and also because fixing this might be helpful for those who would try to debug other aspects of SQL generator.
Using example from #3170
group {
...,
heappages = relpages,
toastpages = relpages ?? 0,
toasttuples = reltuples ?? 0,
...,
} (
aggregate {
tpl_hdr_size = 123 + (count attname),
tpl_data_size = (sum whatever),
}
)
yields:
WITH table_1 AS (
SELECT
...,
COALESCE(reltuples, 0) AS _expr_0,
relpages AS _expr_1,
COALESCE(relpages, 0) AS _expr_2,
123 + COUNT(*) AS _expr_3,
COALESCE(SUM(whatever), 0) AS _expr_4
FROM
...
),
which is then followed by all sorts of _expr_3 + _expr_4 + CASE WHEN _expr_5 = 0 THEN, which is hard to read, let alone debug.
Initially I thought this might be relevant to name shadowing (foo = foo + ...), but, in most cases I've seen so far, it isn't.
Here's what I was able to reduce that example to:
exhibit A
from t
derive {
b = case {
a == 1 => 123,
},
c = case {
a == 0 => 0,
true => b - a,
},
}
select {
c
}
Generated SQL:
WITH table_0 AS (
SELECT
a,
CASE
WHEN a = 1 THEN 123
ELSE NULL
END AS _expr_0
FROM
t
)
SELECT
CASE
WHEN a = 0 THEN 0
ELSE _expr_0 - a
END AS c
FROM
table_0
-- Generated by PRQL compiler version:0.9.2 (https://prql-lang.org)
Expected SQL:
WITH table_0 AS (
SELECT
a,
CASE
WHEN a = 1 THEN 123
ELSE NULL
END AS b
FROM
t
)
SELECT
CASE
WHEN a = 0 THEN 0
ELSE b - a
END AS c
FROM
table_0
exhibit B
from t
group {
a = w,
b = x ?? 0,
} (
aggregate {
c = (count y),
d = (sum z),
}
)
select {
a + b,
c + d,
}
Generated SQL:
WITH table_0 AS (
SELECT
COUNT(*) AS _expr_0,
COALESCE(SUM(z), 0) AS _expr_1,
w AS _expr_2,
COALESCE(x, 0) AS _expr_3
FROM
t
GROUP BY
w,
COALESCE(x, 0)
)
SELECT
_expr_2 + _expr_3,
_expr_0 + _expr_1
FROM
table_0
-- Generated by PRQL compiler version:0.9.2 (https://prql-lang.org)
Expected SQL:
WITH table_0 AS (
SELECT
COUNT(*) AS c,
COALESCE(SUM(z), 0) AS d,
w AS a,
COALESCE(x, 0) AS b
FROM
t
GROUP BY
w,
COALESCE(x, 0)
)
SELECT
a + b,
c + d
FROM
table_0
With trivial changes to the last example, names are not elided:
select {
a, b,
c + d,
}
...
SELECT
w AS a,
COALESCE(x, 0) AS b,
...
SELECT
a,
b,
_expr_0 + _expr_1
FROM
table_0
Same goes for c + d → c, d.
Yes, I think this is valid, and we should keep this open — it would be a nicer result if b were used instead of _expr_0. And it's a very well-written issue — Exhibit A seems very sufficient.
But my guess is that as queries become more complicated, the exact relationship between variables in PRQL & SQL become more difficult to track, and so using names across them consistently becomes less reliable. That's ironically more the case if we become better at https://github.com/PRQL/prql/issues/3170.
And because it's difficult to measure whether we're doing this "correctly", I would worry that work here wouldn't compound / might get removed to make way for functional improvements.
As ever, open to ideas / opinions. and if anyone wants to have a go at improving this, that'd def be an improvement...
But my guess is that as queries become more complicated, the exact relationship between variables in PRQL & SQL become more difficult to track, and so using names across them consistently becomes less reliable. That's ironically more the case if we become better at https://github.com/PRQL/prql/issues/3170.
My initial impression is that #3170, if implemented through CTEs or nested select, shouldn't affect this issue most of the time (in fact, examples above are already generating CTEs, and even if they didn't, into tmp/from tmp hack would quickly make a test case for future versions where #3170 is not an issue).
Shadowing though is problematic, as it definitely requires name mangling of some sort (although it still might be improved by generating something like _foo_1 for foo = (f foo) instead of _expr_7). But even then #3170 might actually come handy, 'cause it makes possible to use CTE boundaries for name shadowing on SQL level, like:
WITH ...,
table_1 AS (
SELECT
something(complicated + foo) as foo,
...
)
...
That definitely would not be possible in all cases though. Also I'm not sure whether f(foo) as foo is more readable than f(_foo_1) as foo: former would be closer to the PRQL source, while the latter would make it easier to navigate generated source code and figure out where one particular _foo_N comes from.