jq icon indicating copy to clipboard operation
jq copied to clipboard

Filter object to specific keys

Open Zirak opened this issue 4 years ago • 20 comments

Hello! Let's imagine the following scenario, where given an object like:

{
    "a": 1,
    "b": 2,
    "c": 3,
    "d": 4
}

We wish to extract only a set of keys, for example only a and c, giving us the following output:

{
    "a": 1,
    "c": 3
}

There are three immediate solutions I see to this:

  1. Explicitly extract these set of keys, e.g.:
{ a: .a, c: .c }

This works but scales less well. In my scenario, there are quite a few of such keys which were quite long and nested inside sub-objects, looking something like:

{ path: { to: { object: { a: .path.to.object.a, c: .path.to.object.c, e: .path.to.object.e, fooblorg: .path.to.object.fooblorg } } } }

Which isn't the nicest thing in the world.

  1. select on the object entries, e.g.:
to_entries | [.[] | select(.key | in(["a", "c"]))] | from_entries

A very interesting alternative, scaling to sub-objects and larger sets of keys. However the reader with the jq interpreter in his mind will immediately shout, just like jq does:

jq: error (at <stdin>:5): Cannot check whether array has a string key

So, we have to supply a "truth map" to in, like so:

to_entries | [.[] | select(.key | in({ a: 1, c: 1 }))] | from_entries

A bit less nice, but still legible. It would however be great to have an operator which searches for a needle in a haystack, the array equivalent for the internal jv_object_has function.

  1. Filter an object given a specific keys There doesn't seem to be such a builtin in jq, which in all fairness makes sense: There's no such builtin in javascript either. In js it would be implemented just like solution (2), i.e. Object.fromEntries(Object.entries(...).filter(...))

Having said that, many toolkits provide such a method:

Anyway, I've rambled long enough. I guess this entire blob could be condensed into

tl;dr

  • It could be useful to have a "value contained in array" filter
  • It could be useful to have a "pick the following keys from an object" filter

What do you think? Thank you for your time.

Zirak avatar Feb 19 '20 08:02 Zirak

Interesting issue. Firstly I wrote a simple implementation using reduce.

def pick($paths):
  . as $root | reduce $paths[] as $path ({}; . + { ($path): $root[$path] });
 % echo '{"a":1,"b":2,"c":3,"d":4}' | jq 'def pick($paths): . as $root | reduce $paths[] as $path ({}; . + { ($path): $root[$path] }); pick(["a", "c"])'
{
  "a": 1,
  "c": 3
}

However this does not allow to pick a value from nested object.

> _.pick({'a':{'x': 1, 'y':2},'b':2, 'c': 3}, [['a', 'x'], 'b'])
{a: {x: 1}, b: 2}

Here's a version with nested keys support ([.]|flatten(1) is dirty, I think there is better solution).

def pick($paths):
  . as $root | reduce ($paths[]|[.]|flatten(1)) as $path ({}; . + setpath($path; $root|getpath($path)));
 % echo '{"a":{"x":1,"y":2},"b":2,"c":3,"d":4}' | jq 'def pick($paths): . as $root | reduce ($paths[]|[.]|flatten(1)) as $path ({}; . + setpath($path; $root|getpath($path))); pick([["a","x"], "b"])'
{
  "a": {
    "x": 1
  },
  "b": 2
}

Renaming problem is that we should skip non-existing paths or not.

itchyny avatar Feb 19 '20 08:02 itchyny

There are at least as many approaches to querying as there are ways of specifying the paths of interest. One generic approach is to use an arbitrary JSON object as a query template:

# Use object obj as a query template.
# Only the all-key paths in obj are considered.
def projection(obj):
  def allpaths:  # include paths to null
    def conditional_recurse(f):  def r: ., (select(.!=null) | f | r); r;
    path(conditional_recurse(.[]?)) | select(length > 0);
  def relevantpaths:
    obj
    | allpaths
    | . as $p
    | select( obj | getpath($p) | type | (. != "array" and . != "object"))
    | select( all(.[]; type == "string"));
  . as $in
  | reduce relevantpaths as $p ({}; setpath($p; $in|getpath($p)));

Example:

def grahame: { "author": { "surname": "Grahame", "firstname":"Kenneth" }, "born": 1859 };

grahame | projection(  { "author": { "surname": null }, "died": 1932 } )

Result:

{
  "author": {
    "surname": "Grahame"
  },
  "died": null
}

pkoppstein avatar Feb 19 '20 20:02 pkoppstein

Considering picking a few paths from a large JSON (e.x. pick only id and name from GitHub API v3 object), I think it's better not to iterate though all the paths.

itchyny avatar Feb 19 '20 21:02 itchyny

@itchyny - If your comment is about def projection, it is misdirected, as allpaths is applied only to the template.

pkoppstein avatar Feb 19 '20 22:02 pkoppstein

  1. Explicitly extract these set of keys, e.g.: { a: .a, c: .c }

You can use shortcut syntax here – just {a, c}. But it won't work for nested structures.

@pkoppstein for template object construction in your example there is also a shortcut syntax for object construction with null values:

{author: {surname}, died}

Funny how both of these cases are the same syntax but different semantics depending on what is .

odnoletkov avatar Feb 20 '20 00:02 odnoletkov

If your comment is about def projection, it is misdirected, as allpaths is applied only to the template.

Ooops, sorry about that.

itchyny avatar Feb 20 '20 00:02 itchyny

@odnoletkov - Yes, but one has to be careful as {author: {surname}, died} is NOT the same as {author: {surname: null}, died: null}. To understand the issue, consider:

def id(x): x;

{author: {surname}, died: {x: 0}} | id({author: {surname}, died})

pkoppstein avatar Feb 20 '20 01:02 pkoppstein

Maybe we can implement haspath builtin as well?

itchyny avatar Feb 20 '20 02:02 itchyny

@itchyny - With the above def of allpaths, haspath can be defined as:

def haspath($p): first(allpaths == $p // empty) // false;

pkoppstein avatar Feb 20 '20 02:02 pkoppstein

@pkoppstein Maybe haspath($p) does not need to list all the paths.

seq 10000 | jq --slurp '
  def allpaths:  # include paths to null
    def conditional_recurse(f):  def r: ., (select(.!=null) | f | r); r;
    path(conditional_recurse(.[]?)) | select(length > 0);
  def haspath($p): first((allpaths|debug) == $p // empty) // false;

haspath([1000])'

itchyny avatar Feb 20 '20 02:02 itchyny

@itchyny wrote:

Maybe haspath($p) does not need to list all the paths.

It doesn't. That's what first/1 is for.

pkoppstein avatar Feb 20 '20 03:02 pkoppstein

I suggest following implementation.

def haspath($path):
  ($path|length) == 0 or
    (type | . == "object" or . == "array") and
    has($path[0]) and (.[$path[0]]|haspath($path[1:]));

def pick($paths):
  . as $root |
    reduce ($paths[]|[.]|flatten(1)) as $path
      ({}; if $root|haspath($path) then . + setpath($path; $root|getpath($path)) else . end);

itchyny avatar Feb 20 '20 04:02 itchyny

@itchyny - Your haspath uses a form of recursion which jq does not support efficiently. Here's an efficient implementation that uses the same insight:

def haspath($path):
  def h:
    . as [$json, $p]
    | (($p|length)==0) or
      ($json | (has($p[0]) and ( [getpath([$p[0]]), $p[1:] ] | h)));
  [., $path] | h;

pkoppstein avatar Feb 20 '20 07:02 pkoppstein

@itchyny - Please read up on TCO on the jq wiki. You could start at https://github.com/stedolan/jq/wiki/Advanced-Topics#tail-call-optimization

pkoppstein avatar Feb 20 '20 07:02 pkoppstein

Please read up on TCO on the jq wiki. You could start at https://github.com/stedolan/jq/wiki/Advanced-Topics#tail-call-optimization

Thanks, mate.

itchyny avatar Feb 20 '20 07:02 itchyny

@itchyny @pkoppstein I was looking for a solution for this problem for a while and found your code very helpful. Thank you! However, I have a use case which doesn't seem to work with the above implementation: If the input object contains an array.

For example:

{
 "a": 1,
 "b": 2,
 "c": [
  {
   "d": 1,
   "e":  2
  },
  {
   "d": 2,
   "e": 3
  }
 ]
}

and we wish to extract only .a and .c[].d, I couldn't an input to pass to pick that would work.

You could select .c[0].d with pick([["c", 0, "d"]]), but you can't pass a wildcard in the form of . or * or something like that instead of 0. I went down the route of enumerating all the possible indices and tried pick(path(.c[].d)) but got jq: error (at test.json:14): Cannot check whether object has a number key. Perhaps you have a suggestion for solving this use case?

Sorry in advance if I'm missing a really obvious way of solving this - I'm a jq newbie.

alon-k avatar Jun 17 '22 21:06 alon-k

@alon-k you mean something like this?

$ echo '{"a":1,"b":2,"c":[{"d":1,"e":2},{"d":2,"e":3}]}' | \
jq 'def pick(p): . as $v | reduce path(p) as $p ({}; setpath($p; $v | getpath($p))); pick(.a, .c[].d)'
{
  "a": 1,
  "c": [
    {
      "d": 1
    },
    {
      "d": 2
    }
  ]
}

But note that with this paths that don't exist in the input will be set to null in the output object. To fix that one probably need to use the haspath thing talked about above.

wader avatar Jun 18 '22 15:06 wader

@wader Yes, exactly. Thank you!

For anyone interested, here is my final code:

def haspath($path):
  def h:
    . as [$json, $p]
    | (($p|length)==0) or
      ($json | (has($p[0]) and ( [getpath([$p[0]]), $p[1:] ] | h)));
  [., $path] | h;

def pick(paths):
  . as $root |
    reduce path(paths) as $path
      ({}; if$root|haspath($path) then . + setpath($path; $root | getpath($path)) else . end);

Result:

$ jq "$(cat better_query.jq)"' pick(.a, .c[].f, .c[].e)' test.json 
{
  "a": 1,
  "c": [
    {
      "e": 2
    },
    {
      "e": 3
    }
  ]
}

alon-k avatar Jun 24 '22 13:06 alon-k

👍 Nice, thanks for sharing

wader avatar Jun 24 '22 16:06 wader

@alon-k -f might be interesting if you want to read a script from a file

wader avatar Jun 24 '22 16:06 wader

The following works using native jq

  to_entries | [.[] | select( .key as $key | any(["a", "b"][] == $key) ))] | from_entries
echo '{"a": 1, "b": 2, "c": 3}' | jq '. | to_entries | [.[] | select( .key as $key | any(["a", "c"][] == $key) )] | from_entries'
{
  "a": 1,
  "c": 3
}

LaurentGoderre avatar Nov 15 '23 21:11 LaurentGoderre

@LaurentGoderre note that ["foo","bar"][] is basically just ("foo", "bar") for constants. You can rewrite that as just:

jq 'del(.[keys[] | select(IN("a", "b", "z") | not)])' <<< '{"a": 1, "b": 2, "c": 3}' 
{
  "a": 1,
  "b": 2
}

or

jq 'with_entries(select(.key | IN("a", "b", "z")))' <<< '{"a": 1, "b": 2, "c": 3}'
{
  "a": 1,
  "b": 2
}

In jq 1.7 with the new pick/1 builtin you can also conveniently use something like

jq 'pick(("a", "b", "z") as $k | select(has($k))[$k])' <<< '{"a": 1, "b": 2, "c": 3}'
{
  "a": 1,
  "b": 2
}

emanuele6 avatar Nov 15 '23 21:11 emanuele6

Could this issue be closed then since there are many documented ways to achieve this?

LaurentGoderre avatar Nov 15 '23 22:11 LaurentGoderre

Well, those "solutions" that were just mentioned would only work for one level.

emanuele6 avatar Nov 15 '23 22:11 emanuele6

Looks resolved by pick/1 in jq 1.7.

itchyny avatar Nov 16 '23 04:11 itchyny