zed icon indicating copy to clipboard operation
zed copied to clipboard

join: In absence of a field-list, merge all fields from left & right hand sides

Open philrz opened this issue 3 years ago • 3 comments

Repro is with Zed commit 793260a.

This is one we'd discussed internally but hadn't yet added. However, a community beta tester just reported having been confused by its absence, hence putting it on the to-do list.

Start from the "inner" example shown in https://github.com/brimdata/zed/blob/main/proc/join/ztests/kinds.yaml. If we drop the hit:=sb portion, the output consists of only the fields from the left-hand record that had a successful match.

$ zq -version
Version: v0.29.0-391-g793260a1

$ cat A.zson 
{a:10,sa:"a0"}
{a:15,sa:"a8"}
{a:20,sa:"a1"}
{a:25,sa:"a7"}
{a:30,sa:"a2"}
{a:35,sa:"a6"}
{a:40,sa:"a3"}
{a:45,sa:"a5"}
{a:50,sa:"a4"}

$ cat B.zson 
{b:20,sb:"b20.1"}
{b:20,sb:"b20.2"}
{b:40,sb:"b40"}
{b:40,sb:"b40.2"}
{b:60,sb:"b60"}

$  zq -z 'inner join on a=b | sort a' A.zson B.zson
{a:20,sa:"a1"}
{a:20,sa:"a1"}
{a:40,sa:"a3"}
{a:40,sa:"a3"}

However, the user was expecting the totality of all fields from both sides for when the match succeeded, hence they were expecting the inclusion of the sb values in this example. So the proposal is that in the absence of an explicit field-list, this would become the default behavior.

This new behavior comes in handy in at least two ways:

  1. If there's a lot of fields from the right-hand side the user wants to preserve. They'd otherwise be in for a lot of typing, with plenty of opportunities to make small typos and miss data.
  2. If the right-hand record types vary such that the user doesn't even know the complete set of possible field names, but they know they want "all the data" in the output, whatever it might be.

Maintaining the ability to selectively cut a subset of fields is still very useful, though, so I'm not advocating removing what's already there.

FWIW, this does seem close to a SQL equivalent. For instance, if you examine https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_join_inner (which is the top "Try it Yourself" linked from https://www.w3schools.com/sql/sql_join_inner.asp)) the SELECT Orders.OrderID, Customers.CustomerName they start out with seems equivalent to the "cut expression" part of the Zed join, and if you run it, indeed you get only those two columns. However if you change it to SELECT * now you get them all. I'm not sure if we want to use a * in place the the cut expression or something else, though. :man_shrugging:

philrz avatar Jun 21 '21 23:06 philrz