Join keys of 2 arrays
Hello, I have a json with 2 arrays with common id's that I'd like to join but I couldn't find a way to do it:
{
"assets":[
{"assetid":1, "classid":1, "instanceid":5},
{"assetid":2, "classid":1, "instanceid":5},
{"assetid":3, "classid":2, "instanceid":3},
{"assetid":4, "classid":2, "instanceid":1}],
"descriptions":[
{"classid":1, "instanceid": 5, "name":"alpha"},
{"classid":2, "instanceid": 3, "name":"beta"},
{"classid":2, "instanceid": 1, "name":"gamma"}]
}
Having the above json as input, I want to get a "name" from "descriptions" for each asset in "assets" by joining classid and instanceid from both arrays.
So the desired output would be:
{"assetid":1, "name":"alpha"}
{"assetid":2, "name":"alpha"}
{"assetid":3, "name":"beta"}
{"assetid":4, "name":"gamma"}
The equivalent in SQL would be something like:
select assets.assetsid, descriptions.name from descriptions, assets where descriptions.classid=assets.classid and descriptions.instanceid=assets.instanceid;
This is an interesting question! In the future, please submit any usage questions to StackOverflow's jq tag.
First, we want to go through each asset:
.assets[] as $asset
Then, for each asset, we want to go through the descriptions, and find the matching one(s) for the asset; the argument passed to select here would be akin to the where clause in your SQL statement:
(.descriptions[] | select($asset.classid == .classid and $asset.instanceid == .instanceid)) as $description
Finally, we build our desired objects out of our $asset and $description pair:
{"assetid": $asset.assetid, "name": $description.name}
All together now:
.assets[] as $asset | (
.descriptions[] | select(
$asset.classid == .classid and
$asset.instanceid == .instanceid
)
) as $description | {
"assetid": $asset.assetid,
"name": $description.name
}
A way to make it more readable would be to define a join filter that can be invoked like join(.assets.classid == .descriptions.classid and ...) over the original object, which would filter all possible combinations of assets and descriptions by the query and then return objects containing the merged keys of the asset and description pairs that matched the query. Something like:
def join(where): [
keys[] as $key |
[{($key): .[$key][]}]
] | combinations | add | select(where) | add;
You can then use this like:
join(
.assets.classid == .descriptions.classid and
.assets.instanceid == .descriptions.instanceid
) | {assetid, name}
This is exactly what I was looking for. Thank you for the detailed explanation!
There's a JOIN in master:
# SQL-ish operators here:
def INDEX(stream; idx_expr):
reduce stream as $row ({};
.[$row|idx_expr|
if type != "string" then tojson
else .
end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def JOIN($idx; idx_expr):
[.[] | [., $idx[idx_expr]]];
def JOIN($idx; stream; idx_expr):
stream | [., $idx[idx_expr]];
def JOIN($idx; stream; idx_expr; join_expr):
stream | [., $idx[idx_expr]] | join_expr;
def IN(s): reduce (first(select(. == s)) | true) as $v (false; if . or $v then true else false end);
def IN(src; s): reduce (src|IN(s)) as $v (false; if . or $v then true else false end);
@pkoppstein The JOIN in master is relational.
Here are filters for performing relational joins that may be suitable for your "standard library" and/or builtin.jq. Suggestions for improvement are welcome!
These filters emit streams and are therefore named "joins"; this name will hopefully also avoid confusion and/or conflict with the existing filter: join(str).
With the OP data in data.json, the filter:
[.assets, .descriptions] | joins( .instanceid ) | [.assetid, .name]
produces:
[1,"alpha"]
[2,"alpha"]
[3,"beta"]
[4,"gamma"]
The same result is produced by:
joins(.assets[]; .descriptions[]; .instanceid; .instanceid) | [.assetid, .name]
joins/1, joins/2, joins/4, joins/6
Notes:
-
joins(f)is not simplyjoins(f;f); rather, it will join arbitrarily many "tables". -
if the filter, f, produces
nullon arow, that row will be ignored. -
f can evaluate to any JSON entity (that is, it is not confined to strings and/or integers)
-
for example, using the OP data, f could be [.instanceid, .classid]
-
joins/6 provides the core functionality, corresponding closely to the SQL INNER JOIN:
SELECT (Table1|p1), (Table2|p2) FROM Table1 INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
def distinct(s):
reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x)
|.[];
# Relational Join
# joins/6 provides similar functionality to the SQL INNER JOIN statement:
# SELECT (Table1|p1), (Table2|p2)
# FROM Table1
# INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are filters.
# joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are filters determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored, otherwise all rows are considered.
#
def joins(s1; s2; filter1; filter2; p1; p2):
def it: type[0:1] + tostring;
def ix(s;f):
reduce s as $x ({}; ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
# combine two dictionaries using the cartesian product of distinct elements
def merge:
.[0] as $d1 | .[1] as $d2
| ($d1|keys_unsorted[]) as $k
| if $d2[$k] then distinct($d1[$k][]|p1) as $a | distinct($d2[$k][]|p2) as $b | [$a,$b]
else empty end;
[ix(s1; filter1), ix(s2; filter2)] | merge;
def joins(s1; s2; filter1; filter2):
joins(s1; s2; filter1; filter2; .; .) | add ;
# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def joins(filter1; filter2):
joins(.[0][]; .[1][]; filter1; filter2);
# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def joins(f):
# j/0 is defined so TCO is applicable
def j:
if length < 2 then .[][]
else [[ joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
end;
j ;
[EDIT (12/13/2017): the core functionality is now provided by joins/6]
These INDEX/JOIN functions help:
def INDEX(stream; idx_expr):
reduce stream as $row ({};
.[$row|idx_expr|(type[0:1] + tostring)] |= .+[$row]);
def INDEX(idx_expr): INDEX(.[]; idx_expr);
def UNIQUE_INDEX(stream; idx_expr):
reduce stream as $row ({};
.[$row |(type[0:1] + tostring)] |=
if .!= null then "duplicate key error; existing row \(.) conflicts with row \($row)"|error
else [$row] end);
def UNIQUE_INDEX(idx_expr): UNIQUE_INDEX(.[]; idx_expr);
def LOOKUP($idx; idx_expr): $idx[idx_expr|(type[0:1] + tostring)]|if .==null then null else .[] end;
def JOIN($idx; idx_expr):
[.[] | LOOKUP($idx; idx_expr) as $row | [., $row]];
def JOIN($idx; stream; idx_expr):
stream | LOOKUP($idx; idx_expr) as $row | [., $row];
def JOIN($idx; stream; idx_expr; join_expr):
stream | LOOKUP($idx; idx_expr) as $row | [., $row] | join_expr;
which you can use like this:
./jq -n '{
"assets":[
{"assetid":1, "classid":1, "instanceid":5},
{"assetid":2, "classid":1, "instanceid":5},
{"assetid":3, "classid":2, "instanceid":3},
{"assetid":4, "classid":2, "instanceid":1}],
"descriptions":[
{"classid":1, "instanceid": 5, "name":"alpha"},
{"classid":2, "instanceid": 3, "name":"beta"},
{"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})'
{
"assetid": 1,
"name": "alpha"
}
{
"assetid": 2,
"name": "beta"
}
{
"assetid": 2,
"name": "gamma"
}
To break this down:
INDEX(.assets[]; .assetid)produces an index of .assets by .assetidJOIN/4- takes an index
- a stream (here
.descriptions[]) - an expression (here
.classid) to extract a key from that stream that will be compared for equality with the keys from the index - an expression to merge the two matching rows on the left and and right of the join
Incidentally, I've a branch that adds varargs support to jq. That will allow us to have multi-column group- and sort-by functions, and multi-column indices that let INDEX figure out how to encode the keys. And much more too (state machines, a proper switch statement, and so on).
@nicowilliams wrote:
JOIN(INDEX(.assets[]; .assetid); .descriptions[]; .classid; {assetid:(.[1].assetid),name:(.[0].name)})' {"assetid": 1, "name": "alpha" } {"assetid": 2, "name": "beta" } {"assetid": 2, "name": "gamma" }
And there's the main problem. As the OP mentioned, the output is supposed to have 4 results: two for "alpha" and one each for "beta" and "gamma".
The other problem, to me at least, is the hairiness of the human interface to JOIN. My joins can be called in the same way as in SQL, i.e. with only direct references to the "tables" in question and to the matching criterion, e.g.
[.assets, .descriptions] | joins( .instanceid )
@pkoppstein If you look at the data, and compute the desired join by hand, you don't get the example result given. The assets object has 5 distinct keys (1..5), while descriptions had two repeated keys (2):
[
{"classid":1, "instanceid": 5, "name":"alpha"},
{"classid":2, "instanceid": 3, "name":"beta"},
{"classid":2, "instanceid": 1, "name":"gamma"}
]
EDIT: I assume the example desired result was just an example of the form of desired result, not the one that would necessarily result from the given example input data.
@nicowilliams - With your JOIN and the following query (which corresponds with the one specified by the OP), the four "expected" results are obtained:
JOIN(INDEX(.assets[]; [.classid,.instanceid]); .descriptions[]; [.classid,.instanceid];
{assetid:(.[1].assetid),name:(.[0].name)})
$ jq -c -f JOIN.jq assets-descriptions.json
{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}
So at least we're on the same page now.
But I would like to see a simpler interface ...
p.s. You could change |= .+ to += in INDEX/2, no?
@pkoppstein No, += is kinda broken. There's an issue open for that.
I'll think about simplifying the interface.
+= is kinda broken. There's an issue open for that.
Yes, I opened it (#1508). I don't think it's relevant in this particular case.
Well, OK, but it hardly makes a difference which I use.
a proper switch statement
Can't wait!
FYPI, I've updated the posting that defines the family of joins functions. Core functionality is now provided by joins(s1; s2; filter1; filter2; p1; p2) corresponding very closely to:
SELECT (Table1|p1), (Table2|p2) FROM Table1 INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
@pkoppstein What I think I'll do is remove the SQL-ish functions in master, finish 1.6, then revisit.
One of the issues we have right now is that linking jq programs is quadratic. Every additional builtin significantly slows down jq start up. I've one thing I could commit soon that will help this, but the real thing we must do is make linking linear, and I only have a half-baked attempt at that.
remove the SQL-ish functions in master
I agree that JOIN can wait; I would like to see a decent IN (assuming there's no time to fix index/1); and certainly if set/1 makes it in, then INDEX can wait too :-)
One thought on Erik Brinkman's contributions: If you delete all the rand functions in builtin.jq but retain the core functionality, we can have our (random) cake and eat it (without quadratic concerns)!
@pkoppstein Everything can wait. What I'd like to do is:
- release 1.6
- take @erikbrinkman's PR
- add your builtins and then some
- fix linking
- fix bugs not fixed in 1.6
- release 1.7
I just don't have time to fix linking now, and we really can't be making startup time much worse.
@nicowilliams - The thing is that INDEX and especially IN have been quite widely "advertised" and used, so if you drop them both without a replacement, it will hurt. As already mentioned, IN is especially valuable in light of the performance issue with index/1.
Since "master" has been able to bear the weight of the SQLish additions without (to my knowledge) anyone raising performance as an issue, it seems to me that it would be reasonable to:
- drop 5 of the SQLish functions (three JOIN and two INDEX)
- add set/1, lookup/1, and distinct/1
- improve IN
- add the rand stuff with NO builtin.jq functions
That way, we'd be down a net of 2 builtins from where we are now in "master".
But, yes, it's time for 1.6 ... so no need to respond to this posting.
@pkoppstein - I like the simple interface ! However, I have a need for LEFT JOIN. Could that be supported by your code easily?
Here is what I have come up so far, based on https://stackoverflow.com/questions/39830426/join-two-json-files-based-on-common-key-with-jq-utility-or-alternative-way-from:
def leftJoin(a1; a2; key):
def akey: key | if type == "string" then . else tojson end;
# hash phase:
a1 as $a1 # content of a1 seems to be lost otherwise (memory issue?)
| a2 | INDEX(key) as $h2
# join phase:
| reduce $a1[] as $a
( [];
($a|akey) as $key
| . + [ $a + if $h2[$key] then $h2[$key] else {} end ]) ;
leftJoin( $file1; $file2; [.id, .ud, .data])
[EDIT: left_joins/6 was corrected on Jan 21, 2018.]
@caramdache - Here are lightly-tested “LEFT JOIN” filters. I've also revised distinct/1 so that it will handle strings more efficiently.
# Emit a stream of the distinct items in the stream
# (avoid tojson)
def distinct(stream):
reduce stream as $s ({};
($s|type) as $t
| (if $t == "string" then $s else ($s|tostring) end) as $y
| .[$t][$y] = [$s] )
| .[][];
# Left Join
# left_joins/6 provides similar functionality to the SQL LEFT JOIN statement:
# SELECT (Table1|p1), (Table2|p2)
# FROM Table1
# LEFT JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are jq filters.
# left_joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are projections determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored,
# otherwise all rows are considered.
#
def left_joins(s1; s2; filter1; filter2; p1; p2):
def it: type[0:1] + tostring;
def ix(s;f):
reduce s as $x ({}; ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
# combine two dictionaries using the cartesian product of distinct elements
def merge:
.[0] as $d1 | .[1] as $d2
| ($d1|keys_unsorted[]) as $k
| distinct($d1[$k][]|p1) as $a
| if $d2[$k] then distinct($d2[$k][]|p2) as $b | [$a,$b]
else [$a, null]
end ;
[ix(s1; filter1), ix(s2; filter2)] | merge;
def left_joins(s1; s2; filter1; filter2):
left_joins(s1; s2; filter1; filter2; .; .) | add ;
# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def left_joins(filter1; filter2):
left_joins(.[0][]; .[1][]; filter1; filter2);
# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def left_joins(f):
# j/0 is defined so TCO is applicable
def j:
if length < 2 then .[][]
else [[ left_joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
end;
j ;
Thanks @pkoppstein, I'll test this first thing on Monday!
I have some questions in the meantime:
- Why the need to use
distinctalways? In SQL, this is a separate keyword that you can add -or not- to the query and shouldn't it be optional here as well? - I am uncertain as to whether items in
s1for whichfilter1 == nullshould always be ignored. - In my use case, I do 2 consecutive left_joins and the objects I am joining to already contain all the fields that I want to obtain as a result. So
p1andp2are the entire set of fields in my case. How can I specify that using your API? I don't exactly want to specify each field manually because there are over 20 in my case.
@caramdache -
Regarding distinct - first, it seems to me that including distinct is far more useful (not least because otherwise you can easily get counterproductive multiplicative behavior); second, you can omit it if you like :-)
Regarding filter1 == null - same kind of reasoning. Also, a row that is equal tonull (as opposed to a compound entity that contains null) really represents the absence of a "row".
Regarding p1 and p2 - in your case, you will probably want to use left_joins/4. Or you could just specify . for p1 and/or p2. That's the beauty of jq.
@pkoppstein, I was thinking distinct and filter1 ==null could be made an option, if we are going to include your code in jq 1.6, which I hope. Regarding p1 and p2, I see. jq feels magical sometimes.
In join/6, isn't there an add missing, which is present in join/4, joing/2 and join/1, or am I missing something obvious?
@caramdache - please note that left_joins/6 has been corrected.
The arity-6 filters are intended to be “low-level building blocks”(*) whereas all the others are intended to be more familiar - that’s why only the arity-4 filters have add explicitly. More documentation and examples are obviously needed regarding this and your other points. Conversely, it would probably help if you could provide some compelling examples to illustrate your concerns.
(*) Notably, the user might wish to define a transformation of the outputs, [$a, $b], that cannot be specified as a function of ($a + $b).
@pkoppstein, this seems to work correctly, as far as I can see on a dataset of 1861 items.
In actual usage, the interface feels a little awkward. I have 2 consecutive joins to perform. With my previous leftJoin/3 method, I could simply pipe them through:
leftJoin($t1; $t2; [.f1]) | leftJoin(.; $t3; [.f2, .f3])
Here, it seems that I have to map(add) / slurp the result of the first join, before I can perform the second join. In addition, I have to create the input arrays, which makes the code a little harder to read. But maybe I am missing something obvious.
BTW, I am not quite sure what changes you've made between your 2 edits. Based on what I can remember, the code looks the same.
Thinking more about it, an even more natural syntax seems like:
$t1 | leftJoin($t2; [.f1]) | leftJoin($t3; [.f2, .f3])
@caramdache --
-
Thanks for testing!
-
As I mentioned, the arity-6 filters are intended to be "low-level" stream-oriented filters. If any of the others are applicable, it would make sense to use them; likewise, it might make sense to define intermediate-level building blocks on top of them.
-
jq is fundamentally stream-oriented, so I think that the signatures of the arity-6 filters and the fact that they emit streams are basically appropriate. Please also note that I've included "joins" in their names, first to emphasize their stream-oriented nature, and second to free-up similar names (e.g. left_join).
-
Assuming the tables $left and $right are presented as arrays, I believe your leftJoin/2 could be implemented as:
def leftJoin($right; filter): [left_joins(.[]; $right[]; filter)];
Maybe I've misunderstood, but it seems that the JOIN example above on the original data / original expected output is not correct, or was correct at the time but is not now correct with jq-1.6. That example is (approximately)
jq -c -n '{
"assets":[
{"assetid":1, "classid":1, "instanceid":5},
{"assetid":2, "classid":1, "instanceid":5},
{"assetid":3, "classid":2, "instanceid":3},
{"assetid":4, "classid":2, "instanceid":1}],
"descriptions":[
{"classid":1, "instanceid": 5, "name":"alpha"},
{"classid":2, "instanceid": 3, "name":"beta"},
{"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
INDEX(.assets[]; [.classid,.instanceid]);
.descriptions[];
[.classid,.instanceid]|tostring;
{assetid:(.[1].assetid),name:(.[0].name)}
)'
which with jq-1.6 produces:
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}
but the original post expected:
{"assetid":1,"name":"alpha"}
{"assetid":2,"name":"alpha"}
{"assetid":3,"name":"beta"}
{"assetid":4,"name":"gamma"}
I think the issue is that the INDEX expression 'loses' one of the two rows that have "classid":1 and "isntanceid":1, as discussed here.
I think if the index is built on descriptions rather than on assets we can produce the original expected results:
jq -c -n '{
"assets":[
{"assetid":1, "classid":1, "instanceid":5},
{"assetid":2, "classid":1, "instanceid":5},
{"assetid":3, "classid":2, "instanceid":3},
{"assetid":4, "classid":2, "instanceid":1}],
"descriptions":[
{"classid":1, "instanceid": 5, "name":"alpha"},
{"classid":2, "instanceid": 3, "name":"beta"},
{"classid":2, "instanceid": 1, "name":"gamma"}]
} | JOIN(
INDEX(.descriptions[]; [.classid, .instanceid]);
.assets[];
[.classid,.instanceid]|tostring;
{assetid:(.[0].assetid),name:(.[1].name)}
)'
Am I missing something?
Thanks,
Andy