prql Excessive intermediate name elision

What's up?

I'm writing this issue following these comments:

"we should ... try and produce legible SQL",
"the examples ... are the biggest selling point for PRQL, in particular the SQL it generates. It looks clean, straightforward, something I would've written myself",

and also because fixing this might be helpful for those who would try to debug other aspects of SQL generator.

Using example from #3170

group {
	...,
	heappages = relpages,
	toastpages = relpages ?? 0,
	toasttuples = reltuples ?? 0,
	...,
} (
	aggregate {
		tpl_hdr_size = 123 + (count attname),
		tpl_data_size = (sum whatever),
	}
)

yields:

WITH table_1 AS (
  SELECT
    ...,
    COALESCE(reltuples, 0) AS _expr_0,
    relpages AS _expr_1,
    COALESCE(relpages, 0) AS _expr_2,
    123 + COUNT(*) AS _expr_3,
    COALESCE(SUM(whatever), 0) AS _expr_4
  FROM
    ...
),

which is then followed by all sorts of _expr_3 + _expr_4 + CASE WHEN _expr_5 = 0 THEN, which is hard to read, let alone debug.

Initially I thought this might be relevant to name shadowing (foo = foo + ...), but, in most cases I've seen so far, it isn't.

Here's what I was able to reduce that example to:

exhibit A

from t
derive {
  b = case {
    a == 1 => 123,
  },
  c = case {
    a == 0 => 0,
    true => b - a,
  },
}
select {
  c
}

Generated SQL:

WITH table_0 AS (
  SELECT
    a,
    CASE
      WHEN a = 1 THEN 123
      ELSE NULL
    END AS _expr_0
  FROM
    t
)
SELECT
  CASE
    WHEN a = 0 THEN 0
    ELSE _expr_0 - a
  END AS c
FROM
  table_0

-- Generated by PRQL compiler version:0.9.2 (https://prql-lang.org)

Expected SQL:

WITH table_0 AS (
  SELECT
    a,
    CASE
      WHEN a = 1 THEN 123
      ELSE NULL
    END AS b
  FROM
    t
)
SELECT
  CASE
    WHEN a = 0 THEN 0
    ELSE b - a
  END AS c
FROM
  table_0

exhibit B

from t
group {
  a = w,
  b = x ?? 0,
} (
  aggregate {
    c = (count y),
    d = (sum z),
  }
)
select {
  a + b,
  c + d,
}

Generated SQL:

WITH table_0 AS (
  SELECT
    COUNT(*) AS _expr_0,
    COALESCE(SUM(z), 0) AS _expr_1,
    w AS _expr_2,
    COALESCE(x, 0) AS _expr_3
  FROM
    t
  GROUP BY
    w,
    COALESCE(x, 0)
)
SELECT
  _expr_2 + _expr_3,
  _expr_0 + _expr_1
FROM
  table_0

-- Generated by PRQL compiler version:0.9.2 (https://prql-lang.org)

Expected SQL:

WITH table_0 AS (
  SELECT
    COUNT(*) AS c,
    COALESCE(SUM(z), 0) AS d,
    w AS a,
    COALESCE(x, 0) AS b
  FROM
    t
  GROUP BY
    w,
    COALESCE(x, 0)
)
SELECT
  a + b,
  c + d
FROM
  table_0

With trivial changes to the last example, names are not elided:

select {
  a, b,
  c + d,
}

...
  SELECT
    w AS a,
    COALESCE(x, 0) AS b,
...
SELECT
  a,
  b,
  _expr_0 + _expr_1
FROM
  table_0

Same goes for c + d → c, d.

Jul 31 '23 13:07 vthriller

Yes, I think this is valid, and we should keep this open — it would be a nicer result if b were used instead of _expr_0. And it's a very well-written issue — Exhibit A seems very sufficient.

But my guess is that as queries become more complicated, the exact relationship between variables in PRQL & SQL become more difficult to track, and so using names across them consistently becomes less reliable. That's ironically more the case if we become better at https://github.com/PRQL/prql/issues/3170.

And because it's difficult to measure whether we're doing this "correctly", I would worry that work here wouldn't compound / might get removed to make way for functional improvements.

As ever, open to ideas / opinions. and if anyone wants to have a go at improving this, that'd def be an improvement...

Jul 31 '23 16:07 max-sixty

But my guess is that as queries become more complicated, the exact relationship between variables in PRQL & SQL become more difficult to track, and so using names across them consistently becomes less reliable. That's ironically more the case if we become better at https://github.com/PRQL/prql/issues/3170.

My initial impression is that #3170, if implemented through CTEs or nested select, shouldn't affect this issue most of the time (in fact, examples above are already generating CTEs, and even if they didn't, into tmp/from tmp hack would quickly make a test case for future versions where #3170 is not an issue).

Shadowing though is problematic, as it definitely requires name mangling of some sort (although it still might be improved by generating something like _foo_1 for foo = (f foo) instead of _expr_7). But even then #3170 might actually come handy, 'cause it makes possible to use CTE boundaries for name shadowing on SQL level, like:

WITH ...,
table_1 AS (
  SELECT
    something(complicated + foo) as foo,
    ...
)
...

That definitely would not be possible in all cases though. Also I'm not sure whether f(foo) as foo is more readable than f(_foo_1) as foo: former would be closer to the PRQL source, while the latter would make it easier to navigate generated source code and figure out where one particular _foo_N comes from.

Aug 10 '23 23:08 vthriller

prql prql copied to clipboard

Excessive intermediate name elision

What's up?

prql
prql copied to clipboard