prql icon indicating copy to clipboard operation
prql copied to clipboard

`LATERAL` joins (AKA `CROSS APPLY`) - necessary for subqueries in which the rhs can reference the lhs

Open GavinRay97 opened this issue 3 years ago • 2 comments

If anyone isn't familiar with this (somewhat obscure) SQL feature, here's a great overview:

  • https://blog.jooq.org/add-lateral-joins-or-cross-apply-to-your-sql-tool-chain/
A very interesting type of JOIN is the LATERAL JOIN (new in PostgreSQL 9.3+),
which is also known as CROSS APPLY/OUTER APPLY in SQL-Server & Oracle.

The basic idea is that a table-valued function (or inline subquery) gets applied for every row you join.

This makes it possible to, for example, only join the first matching entry in another table.

The difference between a normal and a lateral join lies in the fact that you can use a column
that you previously joined in the subquery that you "CROSS APPLY".

You can think of it as like a relational for each loop.

Why this is so important is that without it, it becomes impossible to express queries like:

  • "Give me artists ... and their first 3 albums"
select a.*, alb.*
from   artists a
       cross join lateral (select *
                           from   albums
                           where  albums.artist_id = a.artist_id
                           limit 3) alb

Since you can't put a LIMIT clause on JOINs

select a.*, alb.*
from artists join albums on album_id limit 3 -- not correct

GavinRay97 avatar Jul 28 '22 23:07 GavinRay97

Thanks for the issue @GavinRay97 . I agree Lateral joins are very useful! They're not supported on every dialect, but that doesn't mean we can't add them.

We approach new features from what's required from the query, and then decide how to implement them. So at the moment, IIUC your query correctly, we can express that quite easily with PRQL:

from artists
group artist_id (
  sort (-date)
  take 3
)

(sort isn't in the SQL above but I think implied by your description)

It uses a window rather than a lateral join:

WITH table_0 AS (
  SELECT
    artists.*,
    ROW_NUMBER() OVER (
      PARTITION BY artist_id
      ORDER BY
        date DESC
    ) AS _rn_82
  FROM
    artists
)
SELECT
  table_0.*
FROM
  table_0
WHERE
  _rn_82 <= 3

Does that make sense? Are there examples you've hit of things you can't express with the current impl of PRQL, but could with lateral joins?

max-sixty avatar Jul 30 '22 19:07 max-sixty

How do you continue down the chain of relationships with that form?

Say I wanted to go from artists -> albums -> tracks, with a limit at each level?

With LATERAL or CROSS APPLY you can do this, but I'm not if PRQL allows this For instance, something like this doesn't work -- and subqueries aren't allowed at all so I don't think that works either

# First two artists where name matches <whatever>
table artist_table = (
  from artists
  filter name = "foo"
  take 2
)

# For each artist, their first 3 albums
table album_table = (
  from albums
  filter artist_id in artist_table.artist_id
  take 3
)

# For each album, their first 5 tracks
table tracks_table = (
  from tracks
  filter album_id in album_table.album_id
  take 5
)

from [album_table, artist_table, tracks_table]
select [*]

I'd do it something like this for instance:

SELECT *
FROM
     LATERAL (SELECT * FROM Artist WHERE ... LIMIT 2) AS Artist,
     LATERAL (SELECT * FROM Album WHERE Artist.ArtistId = Album.ArtistId LIMIT 3) AS Album,
     LATERAL (SELECT * FROM Track WHERE Album.AlbumId = Track.AlbumId LIMIT 4) AS Track

GavinRay97 avatar Jul 30 '22 20:07 GavinRay97

In modern PRQL, your example could be written as:

# First two artists where name matches <whatever>
from artists
filter name = "foo"
take 2

# For each artist, their first 3 albums
join albums side:inner (==artist_id)
group albums.artist_id (
  take 3
)

# For each album, their first 5 tracks
join tracks side:inner (tracks.track_id == albums.track_id)
group albums.album_id (
  take 5
)

This is quite ergonomic, so I don't think we need another way of doing the same thing.

We could however, compile some PRQL queries to LATERAL JOINs.

aljazerzen avatar Jul 10 '23 12:07 aljazerzen

It's worth noting that lateral joins have different performance characteristics than CTEs, which can result in considerable speedups:

https://amandasposito.com/postgresql/performance/2021/01/04/postgres-lateral-join.html

https://www.heap.io/blog/postgresqls-powerful-new-join-type-lateral

IMO it would be worth reopening this issue to consider lateral join support as a compiler flag for PRQL for supported target SQL dialects. PRQL's syntax still has a lot to add even in a world of lateral joins!

bpartridge avatar Jul 25 '23 20:07 bpartridge

I agree with @bpartridge . I have a design proposal for LATERAL JOINs which, similar to what we did with loop for RECURSIVE CTEs, in my opinion could really make LATERAL JOINs more accessible and showcase their power to more people.

I'm reopening for now.

snth avatar Jul 25 '23 20:07 snth

Also, it's worth mentioning that the solution given above does not work for most SQL dialects. This is because in many of the table_X CTE scopes, columns with identical names are selected from multiple tables.

For table_2 and table_1 you get:

  • Column alias required for 'artist_id:int(11)'

For table_0 you get:

  • Column alias required for 'artist_id:int(11), name:varchar(200), album_id:int(11)'

See here:

NOTE: Translations to each dialect were done using https://www.jooq.org/translate

  • MariaDB 10 error: https://dbfiddle.uk/6KTLPF5c (Duplicate column name 'artist_id')
  • MySQL 8 error: https://dbfiddle.uk/bIyCtgI5 (Duplicate column name 'artist_id')
  • Oracle 23c error: https://dbfiddle.uk/KiAhdNxM (ORA-00918: column ambiguously defined)
  • SQL Server 2022 error: https://dbfiddle.uk/RMsMu2Ye (The column 'artist_id' was specified multiple times for 'table_2'.)
  • Firebird 4 error: https://dbfiddle.uk/tdH8nVcM (Invalid command column artist_id was specified multiple times for derived table table_2)

Basically, the query only works on Postgres and SQLite.

The query I used was:

👇 CLICK TO DISPLAY PRQL 👇
# First two artists where name matches <whatever>
from artists
filter name == "foo"
take 2

# For each artist, their first 3 albums
join albums side:inner (==artist_id)
group albums.artist_id (
  take 3
)

# For each album, their first 5 tracks
join tracks side:inner (tracks.album_id == albums.album_id)
group albums.album_id (
  take 5
)

GavinRay97 avatar Jul 26 '23 19:07 GavinRay97

This is quite a problem, thanks for detailed report. Since it's more general, I'll open a new issue with a minimal example.

aljazerzen avatar Jul 27 '23 11:07 aljazerzen