prql icon indicating copy to clipboard operation
prql copied to clipboard

Duplicate column names when joining two whole tables

Open aljazerzen opened this issue 1 year ago • 4 comments

What's up?

Follow up for

Imagine we have two tables:

let artists <[{artist_id, name}]>
let albums <[{album_id, name, artist_id}]>

If you do this:

from artists
join albums (albums.artist_id == artists.artist_id)

Our semantics dictate that the following relation should be produced:

[{artists = {artist_id, name}, albums = {album_id, name, artist_id}}]

when this is compiled to SQL, nested tuples are flattened:

SELECT artist_id, name, album_id, name, artist_id
FROM artists
JOIN albums ON albums.artist_id = artists.artist_id

... which is problematic because:

  • you can no longer differentiate between the two artist_ids (which are not necessarily exactly the same, because of NULLs)
  • some dialects will report errors as reported in https://github.com/PRQL/prql/issues/879#issuecomment-1652401282

We can solve this quite easily, by instead compiling to:

SELECT "artists.artist_id", "artists.name", "albums.album_id", "albums.name", "albums.artist_id"
FROM artists
JOIN albums ON albums.artist_id = artists.artist_id

(... which would actually be more correct than what we currently have.)

The problem here is unknown tables. If the compiler has no knowledge of all the table columns, it produces this:

SELECT artists.*, artists.*
FROM artists
JOIN albums ON albums.artist_id = artists.artist_id

And even if SELECT artists.* EXCLUDE (artist_id), there will be clashes on name.

TLDR; when we are joining two tables without complete knowledge of all the columns, it is not possible to produce valid SQL for some dialects

aljazerzen avatar Jul 27 '23 12:07 aljazerzen