jq Join keys of 2 arrays

Hello, I have a json with 2 arrays with common id's that I'd like to join but I couldn't find a way to do it:

{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
}

Having the above json as input, I want to get a "name" from "descriptions" for each asset in "assets" by joining classid and instanceid from both arrays.

So the desired output would be:

{"assetid":1, "name":"alpha"}
{"assetid":2, "name":"alpha"}
{"assetid":3, "name":"beta"}
{"assetid":4, "name":"gamma"}

The equivalent in SQL would be something like: select assets.assetsid, descriptions.name from descriptions, assets where descriptions.classid=assets.classid and descriptions.instanceid=assets.instanceid;

Sep 11 '17 03:09 tsakistan

This is an interesting question! In the future, please submit any usage questions to StackOverflow's jq tag.

First, we want to go through each asset:

.assets[] as $asset

Then, for each asset, we want to go through the descriptions, and find the matching one(s) for the asset; the argument passed to select here would be akin to the where clause in your SQL statement:

(.descriptions[] | select($asset.classid == .classid and $asset.instanceid == .instanceid)) as $description

Finally, we build our desired objects out of our $asset and $description pair:

{"assetid": $asset.assetid, "name": $description.name}

All together now:

.assets[] as $asset | (
  .descriptions[] | select(
    $asset.classid == .classid and
    $asset.instanceid == .instanceid
  )
) as $description | {
  "assetid": $asset.assetid,
  "name": $description.name
}

Sep 11 '17 12:09 ghost

A way to make it more readable would be to define a join filter that can be invoked like join(.assets.classid == .descriptions.classid and ...) over the original object, which would filter all possible combinations of assets and descriptions by the query and then return objects containing the merged keys of the asset and description pairs that matched the query. Something like:

def join(where): [
  keys[] as $key |
  [{($key): .[$key][]}]
] | combinations | add | select(where) | add;

You can then use this like:

join(
  .assets.classid == .descriptions.classid and
  .assets.instanceid == .descriptions.instanceid
) | {assetid, name}

Sep 11 '17 13:09 ghost

This is exactly what I was looking for. Thank you for the detailed explanation!

Sep 11 '17 15:09 tsakistan

There's a JOIN in master:

# SQL-ish operators here:
def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def JOIN($idx; idx_expr):
  [.[] | [., $idx[idx_expr]]];
def JOIN($idx; stream; idx_expr):
  stream | [., $idx[idx_expr]];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | [., $idx[idx_expr]] | join_expr;
def IN(s): reduce (first(select(. == s)) | true) as $v (false; if . or $v then true else false end);
def IN(src; s): reduce (src|IN(s)) as $v (false; if . or $v then true else false end);

Dec 11 '17 05:12 nicowilliams

@pkoppstein The JOIN in master is relational.

Dec 11 '17 16:12 nicowilliams

Here are filters for performing relational joins that may be suitable for your "standard library" and/or builtin.jq. Suggestions for improvement are welcome!

These filters emit streams and are therefore named "joins"; this name will hopefully also avoid confusion and/or conflict with the existing filter: join(str).

With the OP data in data.json, the filter:

[.assets, .descriptions] | joins( .instanceid ) | [.assetid, .name]

produces:

[1,"alpha"]
[2,"alpha"]
[3,"beta"]
[4,"gamma"]

The same result is produced by:

joins(.assets[]; .descriptions[]; .instanceid; .instanceid) | [.assetid, .name]

joins/1, joins/2, joins/4, joins/6

Notes:

joins(f) is not simply joins(f;f); rather, it will join arbitrarily many "tables".
if the filter, f, produces null on a row, that row will be ignored.
f can evaluate to any JSON entity (that is, it is not confined to strings and/or integers)
for example, using the OP data, f could be [.instanceid, .classid]
joins/6 provides the core functionality, corresponding closely to the SQL INNER JOIN:

SELECT (Table1|p1), (Table2|p2) FROM Table1 INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)

def distinct(s):
  reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x)
  |.[];

# Relational Join
# joins/6 provides similar functionality to the SQL INNER JOIN statement:
#   SELECT (Table1|p1), (Table2|p2)
#     FROM Table1
#     INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are filters.

# joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are filters determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored, otherwise all rows are considered.
#
def joins(s1; s2; filter1; filter2; p1; p2):
  def it: type[0:1] + tostring;
  def ix(s;f):
    reduce s as $x ({};  ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
  # combine two dictionaries using the cartesian product of distinct elements
  def merge:
    .[0] as $d1 | .[1] as $d2
    | ($d1|keys_unsorted[]) as $k
    | if $d2[$k] then distinct($d1[$k][]|p1) as $a | distinct($d2[$k][]|p2) as $b | [$a,$b]
      else empty end;

   [ix(s1; filter1), ix(s2; filter2)] | merge;

def joins(s1; s2; filter1; filter2):
  joins(s1; s2; filter1; filter2; .; .) | add ;

# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def joins(filter1; filter2):
  joins(.[0][]; .[1][]; filter1; filter2);

# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def joins(f):
  # j/0 is defined so TCO is applicable
  def j:
    if length < 2 then .[][]
    else [[ joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
    end;
   j ;

[EDIT (12/13/2017): the core functionality is now provided by joins/6]

Dec 11 '17 22:12 pkoppstein

These INDEX/JOIN functions help:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row |(type[0:1] + tostring)] |=
      if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
      else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
  [.[] | LOOKUP($idx; idx_expr) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
  stream | LOOKUP($idx; idx_expr) as $row | [., $row] | join_expr;

which you can use like this:

./jq -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{
  "assetid": 1,
  "name": "alpha"
}
{
  "assetid": 2,
  "name": "beta"
}
{
  "assetid": 2,
  "name": "gamma"
}

To break this down:

INDEX(.assets[]; .assetid) produces an index of .assets by .assetid
JOIN/4
- takes an index
- a stream (here .descriptions[])
- an expression (here .classid) to extract a key from that stream that will be compared for equality with the keys from the index
- an expression to merge the two matching rows on the left and and right of the join

Dec 12 '17 05:12 nicowilliams

Incidentally, I've a branch that adds varargs support to jq. That will allow us to have multi-column group- and sort-by functions, and multi-column indices that let INDEX figure out how to encode the keys. And much more too (state machines, a proper switch statement, and so on).

Dec 12 '17 05:12 nicowilliams

@nicowilliams wrote:

JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})' {"assetid": 1, "name": "alpha" } {"assetid": 2, "name": "beta" } {"assetid": 2, "name": "gamma" }

And there's the main problem. As the OP mentioned, the output is supposed to have 4 results: two for "alpha" and one each for "beta" and "gamma".

The other problem, to me at least, is the hairiness of the human interface to JOIN. My joins can be called in the same way as in SQL, i.e. with only direct references to the "tables" in question and to the matching criterion, e.g.

 [.assets, .descriptions] | joins( .instanceid )

Dec 12 '17 11:12 pkoppstein

@pkoppstein If you look at the data, and compute the desired join by hand, you don't get the example result given. The assets object has 5 distinct keys (1..5), while descriptions had two repeated keys (2):

    [
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}
    ]

EDIT: I assume the example desired result was just an example of the form of desired result, not the one that would necessarily result from the given example input data.

Dec 12 '17 15:12 nicowilliams

@nicowilliams - With your JOIN and the following query (which corresponds with the one specified by the OP), the four "expected" results are obtained:

JOIN(INDEX(.assets[]; [.classid,.instanceid]); .descriptions[]; [.classid,.instanceid];
     {assetid:(.[1].assetid),name:(.[0].name)})

$ jq -c -f JOIN.jq  assets-descriptions.json
{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}

So at least we're on the same page now.

But I would like to see a simpler interface ...

p.s. You could change |= .+ to += in INDEX/2, no?

Dec 12 '17 16:12 pkoppstein

@pkoppstein No, += is kinda broken. There's an issue open for that.

I'll think about simplifying the interface.

Dec 12 '17 16:12 nicowilliams

+= is kinda broken. There's an issue open for that.

Yes, I opened it (#1508). I don't think it's relevant in this particular case.

Dec 12 '17 16:12 pkoppstein

Well, OK, but it hardly makes a difference which I use.

Dec 12 '17 17:12 nicowilliams

a proper switch statement

Can't wait!

FYPI, I've updated the posting that defines the family of joins functions. Core functionality is now provided by joins(s1; s2; filter1; filter2; p1; p2) corresponding very closely to:

SELECT (Table1|p1), (Table2|p2) FROM Table1 INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)

Dec 13 '17 16:12 pkoppstein

@pkoppstein What I think I'll do is remove the SQL-ish functions in master, finish 1.6, then revisit.

One of the issues we have right now is that linking jq programs is quadratic. Every additional builtin significantly slows down jq start up. I've one thing I could commit soon that will help this, but the real thing we must do is make linking linear, and I only have a half-baked attempt at that.

Dec 13 '17 16:12 nicowilliams

remove the SQL-ish functions in master

I agree that JOIN can wait; I would like to see a decent IN (assuming there's no time to fix index/1); and certainly if set/1 makes it in, then INDEX can wait too :-)

One thought on Erik Brinkman's contributions: If you delete all the rand functions in builtin.jq but retain the core functionality, we can have our (random) cake and eat it (without quadratic concerns)!

Dec 13 '17 17:12 pkoppstein

@pkoppstein Everything can wait. What I'd like to do is:

release 1.6
take @erikbrinkman's PR
add your builtins and then some
fix linking
fix bugs not fixed in 1.6
release 1.7

Dec 13 '17 17:12 nicowilliams

I just don't have time to fix linking now, and we really can't be making startup time much worse.

Dec 13 '17 17:12 nicowilliams

@nicowilliams - The thing is that INDEX and especially IN have been quite widely "advertised" and used, so if you drop them both without a replacement, it will hurt. As already mentioned, IN is especially valuable in light of the performance issue with index/1.

Since "master" has been able to bear the weight of the SQLish additions without (to my knowledge) anyone raising performance as an issue, it seems to me that it would be reasonable to:

drop 5 of the SQLish functions (three JOIN and two INDEX)
add set/1, lookup/1, and distinct/1
improve IN
add the rand stuff with NO builtin.jq functions

That way, we'd be down a net of 2 builtins from where we are now in "master".

But, yes, it's time for 1.6 ... so no need to respond to this posting.

Dec 13 '17 18:12 pkoppstein

@pkoppstein - I like the simple interface ! However, I have a need for LEFT JOIN. Could that be supported by your code easily?

Here is what I have come up so far, based on https://stackoverflow.com/questions/39830426/join-two-json-files-based-on-common-key-with-jq-utility-or-alternative-way-from:

def leftJoin(a1; a2; key):
  def akey: key | if type == "string" then . else tojson end;
  # hash phase:
  a1 as $a1 # content of a1 seems to be lost otherwise (memory issue?)
  | a2 | INDEX(key) as $h2
  # join phase:
  | reduce $a1[] as $a
      ( [];
        ($a|akey) as $key
        | . + [ $a + if $h2[$key] then $h2[$key] else {} end ]) ;

leftJoin( $file1; $file2; [.id, .ud, .data])

Jan 19 '18 11:01 caramdache

[EDIT: left_joins/6 was corrected on Jan 21, 2018.]

@caramdache - Here are lightly-tested “LEFT JOIN” filters. I've also revised distinct/1 so that it will handle strings more efficiently.

# Emit a stream of the distinct items in the stream
# (avoid tojson)
def distinct(stream):
   reduce stream as $s ({};
     ($s|type) as $t
     | (if $t == "string" then $s else ($s|tostring) end) as $y
     | .[$t][$y] = [$s] )
   | .[][];

# Left Join
# left_joins/6 provides similar functionality to the SQL LEFT JOIN statement:
#   SELECT (Table1|p1), (Table2|p2)
#     FROM Table1
#     LEFT JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are jq filters.

# left_joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are projections determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored,
# otherwise all rows are considered.
#
def left_joins(s1; s2; filter1; filter2; p1; p2):
  def it: type[0:1] + tostring;
  def ix(s;f):
    reduce s as $x ({};  ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
  # combine two dictionaries using the cartesian product of distinct elements
  def merge:
    .[0] as $d1 | .[1] as $d2
    | ($d1|keys_unsorted[]) as $k
    | distinct($d1[$k][]|p1) as $a
    | if $d2[$k] then distinct($d2[$k][]|p2) as $b | [$a,$b]
      else [$a, null]
      end ;

   [ix(s1; filter1), ix(s2; filter2)] | merge;

def left_joins(s1; s2; filter1; filter2):
  left_joins(s1; s2; filter1; filter2; .; .) | add ;

# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def left_joins(filter1; filter2):
  left_joins(.[0][]; .[1][]; filter1; filter2);

# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def left_joins(f):
  # j/0 is defined so TCO is applicable
  def j:
    if length < 2 then .[][]
    else [[ left_joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
    end;
   j ;

Jan 20 '18 04:01 pkoppstein

Thanks @pkoppstein, I'll test this first thing on Monday!

I have some questions in the meantime:

Why the need to use distinct always? In SQL, this is a separate keyword that you can add -or not- to the query and shouldn't it be optional here as well?
I am uncertain as to whether items in s1 for which filter1 == null should always be ignored.
In my use case, I do 2 consecutive left_joins and the objects I am joining to already contain all the fields that I want to obtain as a result. So p1 and p2 are the entire set of fields in my case. How can I specify that using your API? I don't exactly want to specify each field manually because there are over 20 in my case.

Jan 21 '18 07:01 caramdache

@caramdache -

Regarding distinct - first, it seems to me that including distinct is far more useful (not least because otherwise you can easily get counterproductive multiplicative behavior); second, you can omit it if you like :-)

Regarding filter1 == null - same kind of reasoning. Also, a row that is equal tonull (as opposed to a compound entity that contains null) really represents the absence of a "row".

Regarding p1 and p2 - in your case, you will probably want to use left_joins/4. Or you could just specify . for p1 and/or p2. That's the beauty of jq.

Jan 21 '18 08:01 pkoppstein

@pkoppstein, I was thinking distinct and filter1 ==null could be made an option, if we are going to include your code in jq 1.6, which I hope. Regarding p1 and p2, I see. jq feels magical sometimes.

In join/6, isn't there an add missing, which is present in join/4, joing/2 and join/1, or am I missing something obvious?

Jan 21 '18 15:01 caramdache

@caramdache - please note that left_joins/6 has been corrected.

The arity-6 filters are intended to be “low-level building blocks”(*) whereas all the others are intended to be more familiar - that’s why only the arity-4 filters have add explicitly. More documentation and examples are obviously needed regarding this and your other points. Conversely, it would probably help if you could provide some compelling examples to illustrate your concerns.

(*) Notably, the user might wish to define a transformation of the outputs, [$a, $b], that cannot be specified as a function of ($a + $b).

Jan 21 '18 18:01 pkoppstein

@pkoppstein, this seems to work correctly, as far as I can see on a dataset of 1861 items.

In actual usage, the interface feels a little awkward. I have 2 consecutive joins to perform. With my previous leftJoin/3 method, I could simply pipe them through:

leftJoin($t1; $t2; [.f1]) | leftJoin(.; $t3; [.f2, .f3])

Here, it seems that I have to map(add) / slurp the result of the first join, before I can perform the second join. In addition, I have to create the input arrays, which makes the code a little harder to read. But maybe I am missing something obvious.

BTW, I am not quite sure what changes you've made between your 2 edits. Based on what I can remember, the code looks the same.

Jan 22 '18 09:01 caramdache

Thinking more about it, an even more natural syntax seems like:

$t1 | leftJoin($t2; [.f1]) | leftJoin($t3; [.f2, .f3])

Jan 22 '18 11:01 caramdache

@caramdache --

Thanks for testing!
As I mentioned, the arity-6 filters are intended to be "low-level" stream-oriented filters. If any of the others are applicable, it would make sense to use them; likewise, it might make sense to define intermediate-level building blocks on top of them.
jq is fundamentally stream-oriented, so I think that the signatures of the arity-6 filters and the fact that they emit streams are basically appropriate. Please also note that I've included "joins" in their names, first to emphasize their stream-oriented nature, and second to free-up similar names (e.g. left_join).
Assuming the tables $left and $right are presented as arrays, I believe your leftJoin/2 could be implemented as:

def leftJoin($right; filter): [left_joins(.[]; $right[]; filter)];

Jan 22 '18 23:01 pkoppstein

Maybe I've misunderstood, but it seems that the JOIN example above on the original data / original expected output is not correct, or was correct at the time but is not now correct with jq-1.6. That example is (approximately)

jq -c -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
	INDEX(.assets[]; [.classid,.instanceid]); 
	.descriptions[]; 
	[.classid,.instanceid]|tostring;
 	{assetid:(.[1].assetid),name:(.[0].name)}
)'

which with jq-1.6 produces:

{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}

but the original post expected:

{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}

I think the issue is that the INDEX expression 'loses' one of the two rows that have "classid":1 and "isntanceid":1, as discussed here.

I think if the index is built on descriptions rather than on assets we can produce the original expected results:

jq -c -n '{
    "assets":[
            {"assetid":1, "classid":1, "instanceid":5},
            {"assetid":2, "classid":1, "instanceid":5},
            {"assetid":3, "classid":2, "instanceid":3},
            {"assetid":4, "classid":2, "instanceid":1}],
    "descriptions":[
            {"classid":1, "instanceid": 5, "name":"alpha"},
            {"classid":2, "instanceid": 3, "name":"beta"},
            {"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
	INDEX(.descriptions[]; [.classid, .instanceid]); 
	.assets[];
	[.classid,.instanceid]|tostring;
	{assetid:(.[0].assetid),name:(.[1].name)}
)'

Am I missing something?

Thanks,

Andy

Nov 13 '20 22:11 andrew-schweitzer-analog