prql
prql copied to clipboard
`LATERAL` joins (AKA `CROSS APPLY`) - necessary for subqueries in which the rhs can reference the lhs
If anyone isn't familiar with this (somewhat obscure) SQL feature, here's a great overview:
- https://blog.jooq.org/add-lateral-joins-or-cross-apply-to-your-sql-tool-chain/
A very interesting type of JOIN is the LATERAL JOIN (new in PostgreSQL 9.3+),
which is also known as CROSS APPLY/OUTER APPLY in SQL-Server & Oracle.
The basic idea is that a table-valued function (or inline subquery) gets applied for every row you join.
This makes it possible to, for example, only join the first matching entry in another table.
The difference between a normal and a lateral join lies in the fact that you can use a column
that you previously joined in the subquery that you "CROSS APPLY".
You can think of it as like a relational for each loop.
Why this is so important is that without it, it becomes impossible to express queries like:
- "Give me artists ... and their first
3albums"
select a.*, alb.*
from artists a
cross join lateral (select *
from albums
where albums.artist_id = a.artist_id
limit 3) alb
Since you can't put a LIMIT clause on JOINs
select a.*, alb.*
from artists join albums on album_id limit 3 -- not correct
Thanks for the issue @GavinRay97 . I agree Lateral joins are very useful! They're not supported on every dialect, but that doesn't mean we can't add them.
We approach new features from what's required from the query, and then decide how to implement them. So at the moment, IIUC your query correctly, we can express that quite easily with PRQL:
from artists
group artist_id (
sort (-date)
take 3
)
(sort isn't in the SQL above but I think implied by your description)
It uses a window rather than a lateral join:
WITH table_0 AS (
SELECT
artists.*,
ROW_NUMBER() OVER (
PARTITION BY artist_id
ORDER BY
date DESC
) AS _rn_82
FROM
artists
)
SELECT
table_0.*
FROM
table_0
WHERE
_rn_82 <= 3
Does that make sense? Are there examples you've hit of things you can't express with the current impl of PRQL, but could with lateral joins?
How do you continue down the chain of relationships with that form?
Say I wanted to go from artists -> albums -> tracks, with a limit at each level?
With LATERAL or CROSS APPLY you can do this, but I'm not if PRQL allows this
For instance, something like this doesn't work -- and subqueries aren't allowed at all so I don't think that works either
# First two artists where name matches <whatever>
table artist_table = (
from artists
filter name = "foo"
take 2
)
# For each artist, their first 3 albums
table album_table = (
from albums
filter artist_id in artist_table.artist_id
take 3
)
# For each album, their first 5 tracks
table tracks_table = (
from tracks
filter album_id in album_table.album_id
take 5
)
from [album_table, artist_table, tracks_table]
select [*]
I'd do it something like this for instance:
SELECT *
FROM
LATERAL (SELECT * FROM Artist WHERE ... LIMIT 2) AS Artist,
LATERAL (SELECT * FROM Album WHERE Artist.ArtistId = Album.ArtistId LIMIT 3) AS Album,
LATERAL (SELECT * FROM Track WHERE Album.AlbumId = Track.AlbumId LIMIT 4) AS Track
In modern PRQL, your example could be written as:
# First two artists where name matches <whatever>
from artists
filter name = "foo"
take 2
# For each artist, their first 3 albums
join albums side:inner (==artist_id)
group albums.artist_id (
take 3
)
# For each album, their first 5 tracks
join tracks side:inner (tracks.track_id == albums.track_id)
group albums.album_id (
take 5
)
This is quite ergonomic, so I don't think we need another way of doing the same thing.
We could however, compile some PRQL queries to LATERAL JOINs.
It's worth noting that lateral joins have different performance characteristics than CTEs, which can result in considerable speedups:
https://amandasposito.com/postgresql/performance/2021/01/04/postgres-lateral-join.html
https://www.heap.io/blog/postgresqls-powerful-new-join-type-lateral
IMO it would be worth reopening this issue to consider lateral join support as a compiler flag for PRQL for supported target SQL dialects. PRQL's syntax still has a lot to add even in a world of lateral joins!
I agree with @bpartridge . I have a design proposal for LATERAL JOINs which, similar to what we did with loop for RECURSIVE CTEs, in my opinion could really make LATERAL JOINs more accessible and showcase their power to more people.
I'm reopening for now.
Also, it's worth mentioning that the solution given above does not work for most SQL dialects.
This is because in many of the table_X CTE scopes, columns with identical names are selected from multiple tables.
For table_2 and table_1 you get:
Column alias required for 'artist_id:int(11)'
For table_0 you get:
Column alias required for 'artist_id:int(11), name:varchar(200), album_id:int(11)'
See here:
NOTE: Translations to each dialect were done using https://www.jooq.org/translate
- MariaDB 10 error: https://dbfiddle.uk/6KTLPF5c (
Duplicate column name 'artist_id') - MySQL 8 error: https://dbfiddle.uk/bIyCtgI5 (
Duplicate column name 'artist_id') - Oracle 23c error: https://dbfiddle.uk/KiAhdNxM (
ORA-00918: column ambiguously defined) - SQL Server 2022 error: https://dbfiddle.uk/RMsMu2Ye (
The column 'artist_id' was specified multiple times for 'table_2'.) - Firebird 4 error: https://dbfiddle.uk/tdH8nVcM (
Invalid command column artist_id was specified multiple times for derived table table_2)
Basically, the query only works on Postgres and SQLite.
The query I used was:
👇 CLICK TO DISPLAY PRQL 👇
# First two artists where name matches <whatever>
from artists
filter name == "foo"
take 2
# For each artist, their first 3 albums
join albums side:inner (==artist_id)
group albums.artist_id (
take 3
)
# For each album, their first 5 tracks
join tracks side:inner (tracks.album_id == albums.album_id)
group albums.album_id (
take 5
)
This is quite a problem, thanks for detailed report. Since it's more general, I'll open a new issue with a minimal example.