prql icon indicating copy to clipboard operation
prql copied to clipboard

`select` in joined pipeline causes invalid expressions to be referenced

Open lukapeschke opened this issue 7 months ago • 4 comments

What happened?

Adding a select in a joined pipeline after a few columns have been derived results in CTEs referecing invalid columns

PRQL input

from (s"SELECT a,b,c,d FROM my_table") 
join side:left  (
  s"SELECT e,f,g FROM other_table"
  derive {
    my_concatenated_col = f"{col} {other_col}"
  }
  group {my_concatenated_col} (aggregate { my_sum = first this.`my_concatenated_col`})
  sort {this.my_concatenated_col, my_sum}
  derive {new_name = my_sum}
  select {this.my_concatenated_col, this.new_name}
) (this.my_concatenated_col == that.my_concatenated_col)

SQL output

WITH table_0 AS (
  SELECT
    a,
    b,
    c,
    d
  FROM
    my_table
),
table_2 AS (
  SELECT
    e,
    f,
    g
  FROM
    other_table
),
table_3 AS (
  SELECT
    CONCAT(col, ' ', other_col) AS my_concatenated_col,
    FIRST_VALUE(my_concatenated_col) AS _expr_0
  FROM
    table_2
  GROUP BY
    CONCAT(col, ' ', other_col)
),
table_4 AS (
  SELECT
    my_concatenated_col,
    _expr_0 AS new_name,
    _expr_0
  FROM
    table_3
),
table_1 AS (
  SELECT
    my_concatenated_col,
    new_name,
    CONCAT(col, ' ', other_col) AS my_concatenated_col,
    FIRST_VALUE(my_concatenated_col) AS _expr_0
  FROM
    table_4
)
SELECT
  table_0.*,
  table_1.my_concatenated_col,
  table_1.new_name
FROM
  table_0
  LEFT JOIN table_1 ON table_0.my_concatenated_col = table_1.my_concatenated_col

-- Generated by PRQL compiler version:0.13.4 (https://prql-lang.org)

Expected SQL output


MVCE confirmation

  • [x] Minimal example
  • [x] New issue

Anything else?

table_1 should not exist at all: table_4 already does the selection work, and my_concatenated_col has already been defined.

The following changes result in valid SQL being generate:

  • Removing the derive {new_name = my_sum} line and selecting this.my_sum below
  • Removing the select line altogether
  • Removing the group line

lukapeschke avatar May 21 '25 14:05 lukapeschke

@kgutwin it seems like the issue in table_1 stems from #5098 : in the postprocess stage, this reverts this.my_concatenated_col and my_sum to their original definition in table_3. Commenting the revert_sorts line out fixes this specific issue, but causes the bug in #5098 to appear again.

I'm not really sure what the best way to fix this would be, do you have an idea ?

Many thanks in advance!

lukapeschke avatar Jun 11 '25 16:06 lukapeschke

I'd like to try to help, especially since I can't be certain of the fix in #5098 :)

However, I'm having a bit of trouble following your original issue report. The definitions of col and other_col seem to be missing? Is this a minimal reproduction of the error?

The other strange thing from the provided SQL output is that you have a sort in the query, but it doesn't seem to result in an ORDER BY in the SQL. #5098 is all about sort columns, since PRQL's opinion on sorting is that it should be the very last step in the pipeline (see #1363 as to why). I wonder if the use of group is part of the trigger as well, see https://github.com/PRQL/prql/issues/1363#issuecomment-1368528636.

How much of this issue is resolved by #5305?

Hope that helps the thought process...

kgutwin avatar Jun 11 '25 18:06 kgutwin

@kgutwin Thanks for the quick response :)

You're right, here's an updated example, with one less derive statement. At the bottom of the comment are the results with the current main branch (3f2e21bd) . Removing any of the sort, derive or select statements in the joined pipeline results in a valid query.

If you check out the debug log, You'll see that the pipeline is valid at the end of the sql-anchor stage: table_1 only selects columns 13 and 14. table_3 maps CIds as 4 -> 11, 7 -> 12 and table_4 maps them as 11 -> 13, 12 -> 14.

If we look at the sql-postprocess stage, we can see that what happens is that when processing table_4, we use the CId redirects from table_3 to revert the sort columns:

[DEBUG prqlc::sql::pq::postprocess] infer_sorts: table-4
[DEBUG prqlc::sql::pq::anchor] redirect sorts [] RIId(3) cid_redirects {column-7: column-12, column-4: column-11}
[DEBUG prqlc::sql::pq::anchor] reverting column-11 back to column-4
[DEBUG prqlc::sql::pq::anchor] reverting column-12 back to column-7
[DEBUG prqlc::sql::pq::postprocess] --- sorting [ColumnSort { direction: Asc, column: column-4 }, ColumnSort { direction: Asc, column: column-7 }] 

last_sorting thus becomes [column-4, column-7], as well as ctes_sorting[table_4]

Thus, when processing table_1, in fold_sql_transforms, sorting becomes [column-4, column-7]. Since these columns are not found in table_4's CId mapping, they get inserted into the select statement as-is:

[DEBUG prqlc::sql::pq::postprocess] infer_sorts: table-1
[DEBUG prqlc::sql::pq::anchor] redirect sorts [ColumnSort { direction: Asc, column: column-4 }, ColumnSort { direction: Asc, column: column-7 }] RIId(4) cid_redirects {column-8: column-14, column-12: column-15, column-11: column-13}
[DEBUG prqlc::sql::pq::anchor] mapping column-4 via {column-8: column-14, column-12: column-15, column-11: column-13} to column-4
[DEBUG prqlc::sql::pq::anchor] mapping column-7 via {column-8: column-14, column-12: column-15, column-11: column-13} to column-7
[DEBUG prqlc::sql::pq::postprocess] adding column-4 to [column-13, column-14]
[DEBUG prqlc::sql::pq::postprocess] adding column-7 to [column-13, column-14, column-4] 

Finally, since there's not sort in table_1, last_sorting is [],. which probably explains why no ORDER BY clause appears in the generated SQL, since Sort transforms are not emitted in fold_sql_transforms.

Having columns 11 and 12 added to table_1 instead of 4 and 7 would solve this, but I'm not sure about what condition should determine if the sorts should be reverted or not 😕

Reproduction

from (s"SELECT a,b,c,d FROM my_table")
join side:left  (
  s"SELECT e,f,g FROM other_table"
  group {f} (aggregate { first_e = first this.`e`})
  sort {this.f, this.first_e}
  derive {new_name = this.first_e}
  select {this.f, this.new_name}
) (this.a == that.f)
WITH table_0 AS (
  SELECT
    a,
    b,
    c,
    d
  FROM
    my_table
),
table_2 AS (
  SELECT
    e,
    f,
    g
  FROM
    other_table
),
table_3 AS (
  SELECT
    f,
    FIRST_VALUE(e) AS _expr_0
  FROM
    table_2
  GROUP BY
    f
),
table_4 AS (
  SELECT
    f,
    _expr_0 AS new_name,
    _expr_0
  FROM
    table_3
),
table_1 AS (
  SELECT
    f,
    new_name,
    FIRST_VALUE(e) AS _expr_0
  FROM
    table_4
)
SELECT
  table_0.a,
  table_0.b,
  table_0.c,
  table_0.d,
  table_1.f,
  table_1.new_name
FROM
  table_0
  LEFT OUTER JOIN table_1 ON table_0.a = table_1.f

-- Generated by PRQL compiler version:0.13.5 (https://prql-lang.org)

debug_log.html.txt (had to rename to .txt or github would refuse it)

lukapeschke avatar Jun 12 '25 09:06 lukapeschke

Thanks for breaking it down, this logic is super hard to follow... unfortunately it was back in January when I last looked at this, so most of this reasoning has been forgotten, although I am trying to bring it back.

The revert_sorts function was added in #5098 because without it, a chain of sort | filter | join results in the sort columns getting lost when the filter step results in an intermediate CTE. However, I'm sure there's an issue with its approach somehow, because for you, it's causing that extra FIRST_VALUE(e) AS _expr_0 expression in table_1.


I will hopefully come back to this question, but for now, I will leave you with this crazy situation I found. I took your reproduction PRQL and modified it slightly; I flipped the order of the join so that the larger pipeline is first, in order to not lose the ORDER BY. That looks like this:

s"SELECT e,f,g FROM other_table"
group {f} (aggregate { first_e = first this.`e`})
sort {this.f, this.first_e}
derive {new_name = this.first_e}
select {this.f, this.new_name}
join side:left ( s"SELECT a,b,c,d FROM my_table" ) (this.f == that.a)

This compiles into valid SQL regardless of whether revert_sorts() is commented out in postprocess.rs:164.

I then added a filter statement right after the sort:

s"SELECT e,f,g FROM other_table"
group {f} (aggregate { first_e = first this.`e`})
sort {this.f, this.first_e}
filter this.first_e > 25
derive {new_name = this.first_e}
select {this.f, this.new_name}
join side:left ( s"SELECT a,b,c,d FROM my_table" ) (this.f == that.a)

Now, neither having revert_sorts() enabled or disabled will produce valid SQL. With revert_sorts() enabled, this (#5302) bug happens (you get the extra FIRST_VALUE(e) AS _expr_0 expression); with it disabled, you see the #5097 bug. So it's guaranteed that the current revert_sorts() approach is wrong somehow... 😩

kgutwin avatar Jun 17 '25 14:06 kgutwin

@kgutwin Sorry for delay, it took me time to figure out the bug. Thanks for the test cases, I've adapted them to match the playground tables, and added them to the fix in #5338 . I tested the generated queries in postgres and they work as expected 🙂

lukapeschke avatar Jun 25 '25 14:06 lukapeschke