`select` in joined pipeline causes invalid expressions to be referenced
What happened?
Adding a select in a joined pipeline after a few columns have been derived results in CTEs referecing invalid columns
PRQL input
from (s"SELECT a,b,c,d FROM my_table")
join side:left (
s"SELECT e,f,g FROM other_table"
derive {
my_concatenated_col = f"{col} {other_col}"
}
group {my_concatenated_col} (aggregate { my_sum = first this.`my_concatenated_col`})
sort {this.my_concatenated_col, my_sum}
derive {new_name = my_sum}
select {this.my_concatenated_col, this.new_name}
) (this.my_concatenated_col == that.my_concatenated_col)
SQL output
WITH table_0 AS (
SELECT
a,
b,
c,
d
FROM
my_table
),
table_2 AS (
SELECT
e,
f,
g
FROM
other_table
),
table_3 AS (
SELECT
CONCAT(col, ' ', other_col) AS my_concatenated_col,
FIRST_VALUE(my_concatenated_col) AS _expr_0
FROM
table_2
GROUP BY
CONCAT(col, ' ', other_col)
),
table_4 AS (
SELECT
my_concatenated_col,
_expr_0 AS new_name,
_expr_0
FROM
table_3
),
table_1 AS (
SELECT
my_concatenated_col,
new_name,
CONCAT(col, ' ', other_col) AS my_concatenated_col,
FIRST_VALUE(my_concatenated_col) AS _expr_0
FROM
table_4
)
SELECT
table_0.*,
table_1.my_concatenated_col,
table_1.new_name
FROM
table_0
LEFT JOIN table_1 ON table_0.my_concatenated_col = table_1.my_concatenated_col
-- Generated by PRQL compiler version:0.13.4 (https://prql-lang.org)
Expected SQL output
MVCE confirmation
- [x] Minimal example
- [x] New issue
Anything else?
table_1 should not exist at all: table_4 already does the selection work, and my_concatenated_col has already been defined.
The following changes result in valid SQL being generate:
- Removing the
derive {new_name = my_sum}line andselectingthis.my_sumbelow - Removing the
selectline altogether - Removing the
groupline
@kgutwin it seems like the issue in table_1 stems from #5098 : in the postprocess stage, this reverts this.my_concatenated_col and my_sum to their original definition in table_3. Commenting the revert_sorts line out fixes this specific issue, but causes the bug in #5098 to appear again.
I'm not really sure what the best way to fix this would be, do you have an idea ?
Many thanks in advance!
I'd like to try to help, especially since I can't be certain of the fix in #5098 :)
However, I'm having a bit of trouble following your original issue report. The definitions of col and other_col seem to be missing? Is this a minimal reproduction of the error?
The other strange thing from the provided SQL output is that you have a sort in the query, but it doesn't seem to result in an ORDER BY in the SQL. #5098 is all about sort columns, since PRQL's opinion on sorting is that it should be the very last step in the pipeline (see #1363 as to why). I wonder if the use of group is part of the trigger as well, see https://github.com/PRQL/prql/issues/1363#issuecomment-1368528636.
How much of this issue is resolved by #5305?
Hope that helps the thought process...
@kgutwin Thanks for the quick response :)
You're right, here's an updated example, with one less derive statement. At the bottom of the comment are the results with the current main branch (3f2e21bd) . Removing any of the sort, derive or select statements in the joined pipeline results in a valid query.
If you check out the debug log, You'll see that the pipeline is valid at the end of the sql-anchor stage: table_1 only selects columns 13 and 14. table_3 maps CIds as 4 -> 11, 7 -> 12 and table_4 maps them as 11 -> 13, 12 -> 14.
If we look at the sql-postprocess stage, we can see that what happens is that when processing table_4, we use the CId redirects from table_3 to revert the sort columns:
[DEBUG prqlc::sql::pq::postprocess] infer_sorts: table-4
[DEBUG prqlc::sql::pq::anchor] redirect sorts [] RIId(3) cid_redirects {column-7: column-12, column-4: column-11}
[DEBUG prqlc::sql::pq::anchor] reverting column-11 back to column-4
[DEBUG prqlc::sql::pq::anchor] reverting column-12 back to column-7
[DEBUG prqlc::sql::pq::postprocess] --- sorting [ColumnSort { direction: Asc, column: column-4 }, ColumnSort { direction: Asc, column: column-7 }]
last_sorting thus becomes [column-4, column-7], as well as ctes_sorting[table_4]
Thus, when processing table_1, in fold_sql_transforms, sorting becomes [column-4, column-7]. Since these columns are not found in table_4's CId mapping, they get inserted into the select statement as-is:
[DEBUG prqlc::sql::pq::postprocess] infer_sorts: table-1
[DEBUG prqlc::sql::pq::anchor] redirect sorts [ColumnSort { direction: Asc, column: column-4 }, ColumnSort { direction: Asc, column: column-7 }] RIId(4) cid_redirects {column-8: column-14, column-12: column-15, column-11: column-13}
[DEBUG prqlc::sql::pq::anchor] mapping column-4 via {column-8: column-14, column-12: column-15, column-11: column-13} to column-4
[DEBUG prqlc::sql::pq::anchor] mapping column-7 via {column-8: column-14, column-12: column-15, column-11: column-13} to column-7
[DEBUG prqlc::sql::pq::postprocess] adding column-4 to [column-13, column-14]
[DEBUG prqlc::sql::pq::postprocess] adding column-7 to [column-13, column-14, column-4]
Finally, since there's not sort in table_1, last_sorting is [],. which probably explains why no ORDER BY clause appears in the generated SQL, since Sort transforms are not emitted in fold_sql_transforms.
Having columns 11 and 12 added to table_1 instead of 4 and 7 would solve this, but I'm not sure about what condition should determine if the sorts should be reverted or not 😕
Reproduction
from (s"SELECT a,b,c,d FROM my_table")
join side:left (
s"SELECT e,f,g FROM other_table"
group {f} (aggregate { first_e = first this.`e`})
sort {this.f, this.first_e}
derive {new_name = this.first_e}
select {this.f, this.new_name}
) (this.a == that.f)
WITH table_0 AS (
SELECT
a,
b,
c,
d
FROM
my_table
),
table_2 AS (
SELECT
e,
f,
g
FROM
other_table
),
table_3 AS (
SELECT
f,
FIRST_VALUE(e) AS _expr_0
FROM
table_2
GROUP BY
f
),
table_4 AS (
SELECT
f,
_expr_0 AS new_name,
_expr_0
FROM
table_3
),
table_1 AS (
SELECT
f,
new_name,
FIRST_VALUE(e) AS _expr_0
FROM
table_4
)
SELECT
table_0.a,
table_0.b,
table_0.c,
table_0.d,
table_1.f,
table_1.new_name
FROM
table_0
LEFT OUTER JOIN table_1 ON table_0.a = table_1.f
-- Generated by PRQL compiler version:0.13.5 (https://prql-lang.org)
debug_log.html.txt (had to rename to .txt or github would refuse it)
Thanks for breaking it down, this logic is super hard to follow... unfortunately it was back in January when I last looked at this, so most of this reasoning has been forgotten, although I am trying to bring it back.
The revert_sorts function was added in #5098 because without it, a chain of sort | filter | join results in the sort columns getting lost when the filter step results in an intermediate CTE. However, I'm sure there's an issue with its approach somehow, because for you, it's causing that extra FIRST_VALUE(e) AS _expr_0 expression in table_1.
I will hopefully come back to this question, but for now, I will leave you with this crazy situation I found. I took your reproduction PRQL and modified it slightly; I flipped the order of the join so that the larger pipeline is first, in order to not lose the ORDER BY. That looks like this:
s"SELECT e,f,g FROM other_table"
group {f} (aggregate { first_e = first this.`e`})
sort {this.f, this.first_e}
derive {new_name = this.first_e}
select {this.f, this.new_name}
join side:left ( s"SELECT a,b,c,d FROM my_table" ) (this.f == that.a)
This compiles into valid SQL regardless of whether revert_sorts() is commented out in postprocess.rs:164.
I then added a filter statement right after the sort:
s"SELECT e,f,g FROM other_table"
group {f} (aggregate { first_e = first this.`e`})
sort {this.f, this.first_e}
filter this.first_e > 25
derive {new_name = this.first_e}
select {this.f, this.new_name}
join side:left ( s"SELECT a,b,c,d FROM my_table" ) (this.f == that.a)
Now, neither having revert_sorts() enabled or disabled will produce valid SQL. With revert_sorts() enabled, this (#5302) bug happens (you get the extra FIRST_VALUE(e) AS _expr_0 expression); with it disabled, you see the #5097 bug. So it's guaranteed that the current revert_sorts() approach is wrong somehow... 😩
@kgutwin Sorry for delay, it took me time to figure out the bug. Thanks for the test cases, I've adapted them to match the playground tables, and added them to the fix in #5338 . I tested the generated queries in postgres and they work as expected 🙂