zed icon indicating copy to clipboard operation
zed copied to clipboard

join: In absence of a field-list, merge all fields from left & right hand sides

Open philrz opened this issue 4 years ago • 4 comments

Repro is with Zed commit 793260a.

This is one we'd discussed internally but hadn't yet added. However, a community beta tester just reported having been confused by its absence, hence putting it on the to-do list.

Start from the "inner" example shown in https://github.com/brimdata/super/blob/793260a1e7b68cfdb4cea33e264c73caba0a0689/proc/join/ztests/kinds.yaml. If we drop the hit:=sb portion, the output consists of only the fields from the left-hand record that had a successful match.

$ zq -version
Version: v0.29.0-391-g793260a1

$ cat A.zson 
{a:10,sa:"a0"}
{a:15,sa:"a8"}
{a:20,sa:"a1"}
{a:25,sa:"a7"}
{a:30,sa:"a2"}
{a:35,sa:"a6"}
{a:40,sa:"a3"}
{a:45,sa:"a5"}
{a:50,sa:"a4"}

$ cat B.zson 
{b:20,sb:"b20.1"}
{b:20,sb:"b20.2"}
{b:40,sb:"b40"}
{b:40,sb:"b40.2"}
{b:60,sb:"b60"}

$  zq -z 'inner join on a=b | sort a' A.zson B.zson
{a:20,sa:"a1"}
{a:20,sa:"a1"}
{a:40,sa:"a3"}
{a:40,sa:"a3"}

However, the user was expecting the totality of all fields from both sides for when the match succeeded, hence they were expecting the inclusion of the sb values in this example. So the proposal is that in the absence of an explicit field-list, this would become the default behavior.

This new behavior comes in handy in at least two ways:

  1. If there's a lot of fields from the right-hand side the user wants to preserve. They'd otherwise be in for a lot of typing, with plenty of opportunities to make small typos and miss data.
  2. If the right-hand record types vary such that the user doesn't even know the complete set of possible field names, but they know they want "all the data" in the output, whatever it might be.

Maintaining the ability to selectively cut a subset of fields is still very useful, though, so I'm not advocating removing what's already there.

FWIW, this does seem close to a SQL equivalent. For instance, if you examine https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_join_inner (which is the top "Try it Yourself" linked from https://www.w3schools.com/sql/sql_join_inner.asp)) the SELECT Orders.OrderID, Customers.CustomerName they start out with seems equivalent to the "cut expression" part of the Zed join, and if you run it, indeed you get only those two columns. However if you change it to SELECT * now you get them all. I'm not sure if we want to use a * in place the the cut expression or something else, though. :man_shrugging:

philrz avatar Jun 21 '21 23:06 philrz

While we're waiting on this, @mccanne reminded me that something like this is possible today if you're ok adding the right-hand-side fields as an embedded record. Following on the example shown above:

$ zq -z 'inner join on a=b rhs:=this | sort a' A.zson B.zson
{a:20,sa:"a1",rhs:{b:20,sb:"b20.1"}}
{a:20,sa:"a1",rhs:{b:20,sb:"b20.2"}}
{a:40,sa:"a3",rhs:{b:40,sb:"b40"}}
{a:40,sa:"a3",rhs:{b:40,sb:"b40.2"}}

@mccanne notes that the challenge on doing the "merged" variant might be in dealing with the corner cases, e.g. what to do if the left-hand table has fields with the same names as the ones in the right-hand table. Do we overwrite? Make a separate field with _1 appended to its name? etc.

philrz avatar Jun 22 '21 20:06 philrz

The Zed docs published in #2836 cited this issue as a current limitation. When this issue is addressed, the docs should be updated.

A join test added in #2901 is also affected by this limitation and should be updated when this gets addressed.

philrz avatar Jul 01 '21 22:07 philrz

The Including the entire opposite record section of the Join Tutorial provides relevant detail for approaching this with the current implementation.

philrz avatar Aug 07 '24 19:08 philrz

It looks like the new join implementation added in #5962 has the flexibility that provides the functionality originally sought here.

Details

Taking a step back, here's the input data and updated syntax for the join originally shown in the ztest linked above, updated to run in super commit f955dc9 that was right before the merge of the new join implementation in #5962.

$ cat A.sup
{a:10,sa:"a0"}
{a:15,sa:"a8"}
{a:20,sa:"a1"}
{a:25,sa:"a7"}
{a:30,sa:"a2"}
{a:35,sa:"a6"}
{a:40,sa:"a3"}
{a:45,sa:"a5"}
{a:50,sa:"a4"}

$ cat B.sup
{b:20,sb:"b20.1"}
{b:20,sb:"b20.2"}
{b:40,sb:"b40"}
{b:40,sb:"b40.2"}
{b:60,sb:"b60"}

$ super -version
Version: f955dc9bf

$ super -c "
from A.sup
| inner join (
  from B.sup
) on a=b hit:=sb
| sort a"

{a:20,sa:"a1",hit:"b20.1"}
{a:20,sa:"a1",hit:"b20.2"}
{a:40,sa:"a3",hit:"b40"}
{a:40,sa:"a3",hit:"b40.2"}

Revisiting the user's original problem, when we'd drop the hit:=sb assignment to copy a value over from the right-hand input into the final assembled output record from join, now we see only the fields from the left-hand record that had a successful match.

$ super -c "
from A.sup
| inner join (
  from B.sup
) on a=b        
| sort a"

{a:20,sa:"a1"}
{a:20,sa:"a1"}
{a:40,sa:"a3"}
{a:40,sa:"a3"}

Now we'll advance to super commit f86de86 that includes the new join implementation from #5962. In the new join implementation, the "assignment" field-list portion no longer exists, and instead the left/right portions of the successful match appear as separate embedded records, e.g.,

$ super -version
Version: f86de86d5

$ super -c "
from A.sup
| inner join (
  from B.sup
) on left.a=right.b"

{left:{a:20,sa:"a1"},right:{b:20,sb:"b20.1"}}
{left:{a:20,sa:"a1"},right:{b:20,sb:"b20.2"}}
{left:{a:40,sa:"a3"},right:{b:40,sb:"b40"}}
{left:{a:40,sa:"a3"},right:{b:40,sb:"b40.2"}}

Like this issue's title states, "all fields from left & right hand sides" are present, so it's up to the user if/how they want to "merge" them into a single top-level record if that's their preference. If they want "all fields" as the issue's title states, a spread does the trick. As you can see, there's no need to know the names of specific fields on the right-hand side and explicitly copy them over: The spread of ...right just gets everything.

$ super -c "
from A.sup
| inner join (
  from B.sup
) on left.a=right.b
| values {...left, ...right}
| sort a"

{a:20,sa:"a1",b:20,sb:"b20.1"}
{a:20,sa:"a1",b:20,sb:"b20.2"}
{a:40,sa:"a3",b:40,sb:"b40"}
{a:40,sa:"a3",b:40,sb:"b40.2"}

Of course, depending on the nature of the data, this simple approach may not be desirable since by the nature of spread if there's fields of the same name in both left and right (not the case here, but could be elsewhere), the ones from right would overwrite the ones in left. If the user knows which side they want to "win" when there's a collision, they could change the spread, e.g., {...right, ...left} here would make it so when there's a collision the ones from inside left would "win". And finally, if they want to add logic to detect possible collisions and cook up unique field names ending in _1 or wrap collisions in error values, that's also possible and shown in separate issue #4760.

I'll hold this one open for some peer review to make sure I'm not missing any other subtleties, but if it passes muster I'll close it out.

philrz avatar Jul 03 '25 00:07 philrz