prql
prql copied to clipboard
Join three tables on same column name
I just tried to upgrade to 0.3.0 and ran into the following issue:
We have 3 tables which all have the call_id
column and I want to join them all together:
table output = (
from hungup
join side:left beginning [==call_id]
join side:left accepted [==call_id]
[...]
)
This now fails at the line join side:left accepted [==call_id]
with the error Ambiguous reference. Could be from any of _frame.hungup.call_id, _frame.beginning.call_id
.
Is this a bug or do I need to do something differently?
When running this in the playground, I don't see an error (the Playground does use the main
branch rather than a published version, so that might be why):
table output = (
from hungup
join side:left beginning [==call_id]
join side:left accepted [==call_id]
)
from output
However, since 0.3.0, the compiler is indeed more strict in general around expecting you to be explicit when there's potential ambiguity – there's a difference between these two (and the compiler doesn't want to randomly guess which one you mean):
left join accepted on hungup.call_id = accepted.call_id
and
left join accepted on beginning.call_id = accepted.call_id
You can satisfy the compiler by being more explicit:
table output = (
from hungup
join side:left beginning [hungup.call_id == beginning.call_id] # this line could probably technically be left unchanged
join side:left accepted [hungup.call_id == accepted.call_id] # this one needs you to be explicit, as now 2 other tables have call_id
[...]
)
Edit: fixed the 2 SQL examples 🤦
Makes sense and works for me. Thanks for the help.
Yeah, this is a downside of not using USING (foo)
any longer — instead we use ON x.foo = y.foo
— but that means that multiple tables can't be joined without being explicit, since USING
collapses them into one column, but ON
doesn't.
I'll reopen this to see whether others have views. I was a fan of USING
but maybe alone in that!
Given that we decide in favor of option 1 from #1193, this should error out.
As @mklopets pointed out there are 3 possible call_id
columns which are not necessarily identical - because of side:left
, some values in some of the columns may be null. So the distinction between them is important.
That said, maybe we need a better solution for multiple joins. Possibly via referring to the foreign key and using database schema to derive the table name and join filter.
Given that we decide in favor of option 1 from #1193, this should error out.
I'm not sure we've decided this yet! :) (Though nod to Do-ocracy...)
As @mklopets pointed out there are 3 possible
call_id
columns which are not necessarily identical - because ofside:left
, some values in some of the columns may be null. So the distinction between them is important.
Ah, this is a good point in this example given they are left-joins; I don't think there's a way around that. Though in the case that that they are equi-joins; i.e.:
from hungup
join beginning [==call_id]
join accepted [==call_id]
...then we could use the same column, but I think it relies on using USING
, which collapses the columns. The current result for this isn't valid, but the diff here would be:
SELECT
hungup.*,
beginning.*,
accepted.*
FROM
hungup
- JOIN beginning ON hungup.call_id = beginning.call_id
- JOIN accepted ON call_id = accepted.call_id
+ JOIN beginning USING (call_id)
+ JOIN accepted USING (call_id)
That said, maybe we need a better solution for multiple joins. Possibly via referring to the foreign key and using database schema to derive the table name and join filter.
Yes, many of these discussions end up with something like "If we knew the semantic model, we could know how to join the tables, at least in 90% of cases". I don't have a great visualization of what that would look like though, without a big jump in the language. I'm also reminded of https://github.com/prql/prql/issues/723 & https://github.com/prql/prql/issues/716
I meant:
In scenario when we decide in favor of option 1 from https://github.com/prql/prql/issues/1193, this should error out.
I regularly have to join 4 or 5 different tables and it's very important to me that the join syntax is as clean as possible. In join syntax, what's most important to me is that column names are specified only once to reduce noise and mistakes/typos.
I'm happy with the current behaviour as described in #1335 (i.e. not generating USING
), but I'd like to ask for syntax that makes it easy to reference the table that we just joined.
For example, something like this would be nice
from hungup
join beginning [==call_id]
join accepted [=={last,that}.call_id]
In the above example, last
is referring to the table we just joined (accepted
). (Yes last
is a confusing name we can definitely come up with something better).
The benefit is apparent when you have a long chain of one-to-many relationship you need to join
from grandparent
join parent [==grandparent_id]
join child [=={last,that}.parent_id]
join grandchild [=={last,that}.child_id]
Now as a reader is only presented with the important information (table / column names) only once.
Other ideas
We can also make [==col_name]
expand to the name of the table that was most recently joined.
Which means the above example becomes just:
from grandparent
join parent [==grandparent_id]
join child [==parent_id]
join grandchild [==child_id]
Which expands to the SQL
select
...
from grandparent
join parent on (grandparent.grandparent_id = parent.grandparent_id)
join child on (parent.parent_id = child.parent_id)
join grandchild on (child.child_id = grandchild.child_id)
The issue with having concepts such as last
is it makes the pipeline state much more complicated — we have to consider the order of the transforms that came before it, not just the result of the pipeline. The additional complication affects both the compiler and users — e.g. copying & pasting some lines isn't quite as reliable.
I agree something like that would make doing lots of joins easier, though.
Would using USING
achieve that too though? At first glance, it seems that would condense these repeated joins too?
I believe that USING is not very well specified in the standard (e.g. https://community.snowflake.com/s/question/0D50Z00008WRZBBSA5/bug-with-join-using-) and the coalesce behavior removes flexibility from the user.
EDIT: Added missing "not"
I believe that USING is very well specified in the standard
The accepted answer for the question you've linked to says that USING is not part of the ANSI standard and discouraged by Snowflake team. We've had a similar experience when trying to achieve the same behavior on different DBMSs.
In any case, we should be discussing "how PRQL does joins", not "how SQL does joins".
Current behavior is indeed pragmatic and tedious, there should be a better way to do it.
Current behavior is indeed pragmatic and tedious, there should be a better way to do it.
(typo? Or you mean it is pragmatic?)
The accepted answer for the question you've linked to says that USING is not part of the ANSI standard and discouraged by Snowflake team.
Interesting, good point.
I do think USING
has solved so many of my JOIN
issues through time, I've even structured my tables to work well with it (e.g. ensuring columns were named the same across tables). But I recognize our views differ, and I do update on the opinion from snowflake at that link.
I do think this is a difficult issue — we can either:
- make it quite easy iff we know all the columns — i.e. a more complete version of
USING
- ...but if we don't know all the columns, then we can't, since
from a | join b (==x) | join c (==y)
doesn't know where each column is[^1]
A language with them can be confusing, and it's rare we encounter them. So the current approach doesn't use the information that it often-but-not-always-has.
[^1]: this was an advantage of USING
— that the DB knows the columns!
Sorry @aljazerzen yes I meant NOT very well specified 😅 Edited and yes I totally agree with "In any case, we should be discussing "how PRQL does joins", not "how SQL does joins"." but I understand there are various constraints and it isn't easy.