prql After Join generated extra ORDER BY section with phantom field

What happened?

In non generic targets the extra ORDER BY section is added in the end of the output. In my input I don't specify sorting and I don't need it.

And there is a tracks.name field from the tracks table in this section, which is not presented in FROM section. It produces the error:

Reproduced in the website playground.

PRQL input

prql target:sql.postgres

from tracks
group media_type_id(
  sort name
  take 1
)
join media_types (== media_type_id)
select {
  tracks.track_id,
  media_types.name
}

SQL output

WITH table_0 AS (
  SELECT
    DISTINCT ON (media_type_id) track_id,
    media_type_id,
    name
  FROM
    tracks
  ORDER BY
    media_type_id,
    name
)
SELECT
  table_0.track_id,
  media_types.name
FROM
  table_0
  JOIN media_types ON table_0.media_type_id = media_types.media_type_id
ORDER BY
  table_0.media_type_id,
  tracks.name

Expected SQL output

WITH table_0 AS (
  SELECT
    DISTINCT ON (media_type_id) track_id,
    media_type_id,
    name
  FROM
    tracks
  ORDER BY
    media_type_id,
    name
)
SELECT
  table_0.track_id,
  media_types.name
FROM
  table_0
  JOIN media_types ON table_0.media_type_id = media_types.media_type_id

MVCE confirmation

[X] Minimal example
[X] New issue

Anything else?

No response

Jun 18 '24 14:06 annashmatko

Yes, it seems to retain the sorting even though it's not needed in the DISTINCT ON case... Thanks for the report.

Jun 21 '24 16:06 max-sixty

I did some playing around with this and it seems to be caused by the combination of:

postgres or duckdb dialects
take 1 which specialises to DISTINCT ON for those dialects.

If you remove the dialect or change it to take 2 then it switches to the ROW_NUMBER() window function based approach which is more general and doesn't have that ORDER BY issue.

Is there any benefit to keep that DISTINCT ON specialisation or could we just use the more general algorithm?

Btw, in the general case, the intermediate variable _expr_0 leaks into the result set, e.g.

prql target:sql.postgres

from tracks
group media_type_id(
  sort name
  take 2
)
join media_types (== media_type_id)

Extra columns can often just be ignored but it might cause an issue for some people.

Jul 10 '24 12:07 snth

I use a ClickHouse dialect in my job.

In a clickhouse db I have a table where each row is a new state of some object. And a use case actually is to receive the last state.

So I group by object_id, order by update_date_time desc and take 1. This method gives me the last row in each group. I cannot use take 2 in this example.

Extra columns can often just be ignored but it might cause an issue for some people.

For instance, if I later UNION tables I get 'mismatch column error' as one of a tabe has extra hidden columns.

Jul 10 '24 13:07 annashmatko