prql icon indicating copy to clipboard operation
prql copied to clipboard

Join three tables on same column name

Open neelance opened this issue 2 years ago • 13 comments

I just tried to upgrade to 0.3.0 and ran into the following issue:

We have 3 tables which all have the call_id column and I want to join them all together:

table output = (
  from hungup
  join side:left beginning [==call_id]
  join side:left accepted [==call_id]
  [...]
)

This now fails at the line join side:left accepted [==call_id] with the error Ambiguous reference. Could be from any of _frame.hungup.call_id, _frame.beginning.call_id.

Is this a bug or do I need to do something differently?

neelance avatar Dec 01 '22 14:12 neelance

When running this in the playground, I don't see an error (the Playground does use the main branch rather than a published version, so that might be why):

table output = (
  from hungup
  join side:left beginning [==call_id]
  join side:left accepted [==call_id]
)

from output

However, since 0.3.0, the compiler is indeed more strict in general around expecting you to be explicit when there's potential ambiguity – there's a difference between these two (and the compiler doesn't want to randomly guess which one you mean):

left join accepted on hungup.call_id = accepted.call_id

and

left join accepted on beginning.call_id = accepted.call_id

You can satisfy the compiler by being more explicit:

table output = (
  from hungup
  join side:left beginning [hungup.call_id == beginning.call_id] # this line could probably technically be left unchanged
  join side:left accepted [hungup.call_id == accepted.call_id] # this one needs you to be explicit, as now 2 other tables have call_id
  [...]
)

Edit: fixed the 2 SQL examples 🤦

mklopets avatar Dec 01 '22 14:12 mklopets

Makes sense and works for me. Thanks for the help.

neelance avatar Dec 01 '22 15:12 neelance

Yeah, this is a downside of not using USING (foo) any longer — instead we use ON x.foo = y.foo — but that means that multiple tables can't be joined without being explicit, since USING collapses them into one column, but ON doesn't.

I'll reopen this to see whether others have views. I was a fan of USING but maybe alone in that!

max-sixty avatar Dec 01 '22 23:12 max-sixty

Given that we decide in favor of option 1 from #1193, this should error out.

As @mklopets pointed out there are 3 possible call_id columns which are not necessarily identical - because of side:left, some values in some of the columns may be null. So the distinction between them is important.

That said, maybe we need a better solution for multiple joins. Possibly via referring to the foreign key and using database schema to derive the table name and join filter.

aljazerzen avatar Dec 02 '22 10:12 aljazerzen

Given that we decide in favor of option 1 from #1193, this should error out.

I'm not sure we've decided this yet! :) (Though nod to Do-ocracy...)

As @mklopets pointed out there are 3 possible call_id columns which are not necessarily identical - because of side:left, some values in some of the columns may be null. So the distinction between them is important.

Ah, this is a good point in this example given they are left-joins; I don't think there's a way around that. Though in the case that that they are equi-joins; i.e.:

from hungup
join beginning [==call_id]
join accepted [==call_id]

...then we could use the same column, but I think it relies on using USING, which collapses the columns. The current result for this isn't valid, but the diff here would be:

SELECT
  hungup.*,
  beginning.*,
  accepted.*
FROM
  hungup
-  JOIN beginning ON hungup.call_id = beginning.call_id
-  JOIN accepted ON call_id = accepted.call_id
+  JOIN beginning USING (call_id)
+  JOIN accepted USING (call_id)

That said, maybe we need a better solution for multiple joins. Possibly via referring to the foreign key and using database schema to derive the table name and join filter.

Yes, many of these discussions end up with something like "If we knew the semantic model, we could know how to join the tables, at least in 90% of cases". I don't have a great visualization of what that would look like though, without a big jump in the language. I'm also reminded of https://github.com/prql/prql/issues/723 & https://github.com/prql/prql/issues/716

max-sixty avatar Dec 02 '22 20:12 max-sixty

I meant:

In scenario when we decide in favor of option 1 from https://github.com/prql/prql/issues/1193, this should error out.

aljazerzen avatar Dec 03 '22 09:12 aljazerzen

I regularly have to join 4 or 5 different tables and it's very important to me that the join syntax is as clean as possible. In join syntax, what's most important to me is that column names are specified only once to reduce noise and mistakes/typos.

I'm happy with the current behaviour as described in #1335 (i.e. not generating USING), but I'd like to ask for syntax that makes it easy to reference the table that we just joined.

For example, something like this would be nice

from hungup
join beginning [==call_id]
join accepted [=={last,that}.call_id]

In the above example, last is referring to the table we just joined (accepted). (Yes last is a confusing name we can definitely come up with something better).

The benefit is apparent when you have a long chain of one-to-many relationship you need to join

from grandparent
join parent [==grandparent_id]
join child [=={last,that}.parent_id]
join grandchild [=={last,that}.child_id]

Now as a reader is only presented with the important information (table / column names) only once.

Other ideas

We can also make [==col_name] expand to the name of the table that was most recently joined. Which means the above example becomes just:

from grandparent
join parent [==grandparent_id]
join child [==parent_id]
join grandchild [==child_id]

Which expands to the SQL

select
  ...
from grandparent
join parent on (grandparent.grandparent_id = parent.grandparent_id)
join child on (parent.parent_id = child.parent_id)
join grandchild on (child.child_id = grandchild.child_id)

jatcwang avatar Nov 23 '23 15:11 jatcwang

The issue with having concepts such as last is it makes the pipeline state much more complicated — we have to consider the order of the transforms that came before it, not just the result of the pipeline. The additional complication affects both the compiler and users — e.g. copying & pasting some lines isn't quite as reliable.

I agree something like that would make doing lots of joins easier, though.

Would using USING achieve that too though? At first glance, it seems that would condense these repeated joins too?

max-sixty avatar Dec 22 '23 20:12 max-sixty

I believe that USING is not very well specified in the standard (e.g. https://community.snowflake.com/s/question/0D50Z00008WRZBBSA5/bug-with-join-using-) and the coalesce behavior removes flexibility from the user.

EDIT: Added missing "not"

jatcwang avatar Jan 01 '24 11:01 jatcwang

I believe that USING is very well specified in the standard

The accepted answer for the question you've linked to says that USING is not part of the ANSI standard and discouraged by Snowflake team. We've had a similar experience when trying to achieve the same behavior on different DBMSs.

In any case, we should be discussing "how PRQL does joins", not "how SQL does joins".

Current behavior is indeed pragmatic and tedious, there should be a better way to do it.

aljazerzen avatar Jan 01 '24 15:01 aljazerzen

Current behavior is indeed pragmatic and tedious, there should be a better way to do it.

(typo? Or you mean it is pragmatic?)

max-sixty avatar Jan 01 '24 16:01 max-sixty

The accepted answer for the question you've linked to says that USING is not part of the ANSI standard and discouraged by Snowflake team.

Interesting, good point.

I do think USING has solved so many of my JOIN issues through time, I've even structured my tables to work well with it (e.g. ensuring columns were named the same across tables). But I recognize our views differ, and I do update on the opinion from snowflake at that link.


I do think this is a difficult issue — we can either:

  • make it quite easy iff we know all the columns — i.e. a more complete version of USING
  • ...but if we don't know all the columns, then we can't, since from a | join b (==x) | join c (==y) doesn't know where each column is[^1]

A language with them can be confusing, and it's rare we encounter them. So the current approach doesn't use the information that it often-but-not-always-has.

[^1]: this was an advantage of USING — that the DB knows the columns!

max-sixty avatar Jan 01 '24 17:01 max-sixty

Sorry @aljazerzen yes I meant NOT very well specified 😅 Edited and yes I totally agree with "In any case, we should be discussing "how PRQL does joins", not "how SQL does joins"." but I understand there are various constraints and it isn't easy.

jatcwang avatar Jan 02 '24 10:01 jatcwang