TiddlyWiki5 icon indicating copy to clipboard operation
TiddlyWiki5 copied to clipboard

New filter operators to read arbitrary JSON structures

Open Jermolene opened this issue 2 years ago • 34 comments

This pull request adds three new filter operators that allow data to be read from JSON tiddlers:

  • getjson to retrieve a value
  • typejson to retrieve the type of a value
  • indexesjson to retrieve the indexes of an array or key names of an object

The new operators support arbitrary JSON objects, not just the usual key:value pairs.

Usage

The following examples assume this JSON data is stored in a data tiddler called "foobar":

{
    "a": "one",
    "b": "",
    "c": "three",
    "d": {
        "e": "four",
        "f": [
            "five",
            "six",
            true,
            false,
            null
        ],
        "g": {
            "x": "max",
            "y": "may",
            "z": "maize"
        }
    }
}

Values are identified by a sequence of indexes. For example, the value at [a] is one, and the value at [d][f][0] is five.

The new operators use multiple operands to specify the indexes:

[[foobar]getjson[a]] --> "one"
[[foobar]getjson[d],[e]] --> "four"
[[foobar]getjson[d],[f],[0]] --> "five"

Indexes can be dynamically composed from variables and transclusions:

[[foobar]getjson<variable>,{!!field},[0]]

Boolean values and null are returned as normal strings. The typejson operator returns a string identifying the original type. Thus:

[[foobar]typejson[a]] --> "string"
[[foobar]typejson[d]] --> "object"
[[foobar]typejson[d],[f]] --> "array"
[[foobar]typejson[d],[f],[2]] --> "boolean"

Using the getjson operator to retrieve an object or an array returns a list of all the values. For example:

[[foobar]getjson[d],[f]] --> "five","six","true","false","null"
[[foobar]getjson[d],[g]] --> "max", "may", "maize"

The indexesjson operator retrieves the corresponding indexes:

[[foobar]indexesjson[d],[f]] --> "0", "1", "2", "3", "4"
[[foobar]indexesjson[d],[g]] --> "x", "y", "z"

The behaviour when retrieving an object or array that contains values that are themselves objects or arrays may be controversial. The current implementation recursively retrieves the values. The obvious alternative would be to ignore objects and arrays but that also seems unsatisfactory. The current behaviour looks like this:

[[foobar]getjson[d]] --> "four", "five", "six", "true", "false", "null", "max", "may", "maize"

A further subtlety is that the special case of a single blank operand is used to identify the root object. Thus:

[[foobar]indexesjson[]] --> "a", "b", "c", "d"

UPDATE: To add an example of using a list widget to iterate through the properties of an array within a JSON object:


<$list filter="[[foobar]jsonindexes[d][f]]">
<div>
<$text text=<<currentTiddler>>/>: <$text text={{{ [[foobar]jsonget[d][f]<currentTiddler>]  }}}/>
</div>
</$list>

Prints:

0: five
1: six
2: true
3: false
4: null

Use Cases

There are several motivations for these new operators:

  • To be able to process arbitrary JSON data returned by APIs or imported from other systems
  • To be able to pass large blocks of structured data to action strings (the immediate use case is some work I'm doing on the Dynannotate plugin)

Onward Development

These operators only support reading JSON data. It would also be useful to introduce action widgets for writing or modifying arbitrary JSON data structures.

These operators require the JSON data to be stored in a data tiddler; they don't permit operations on (for example) JSON data stored within a variable. It is done this way so that we can take advantage of the existing mechanisms for caching the parsed JSON data.

To Do

  • [ ] Documentation
  • [ ] Review whether the helper functions in json-ops.js should be in wiki.js

Jermolene avatar Mar 11 '22 10:03 Jermolene

In terms of naming and syntax for operators, I wonder if something along these lines might be more intuitive: [[foobar]json:get[a]] or [[foobar]json-get[a]] instead of [[foobar]getjson[a]]

saqimtiaz avatar Mar 11 '22 10:03 saqimtiaz

What do I get with [[foobar]getjson[]]?

pmario avatar Mar 11 '22 10:03 pmario

In terms of naming and syntax for operators, I wonder if something along these lines might be more intuitive: [[foobar]json:get[a]] or [[foobar]json-get[a]] instead of [[foobar]getjson[a]]

I'm not sold on the current names. Making these operators be a suffix to a single json operator seems like a nice idea, but perhaps inconsistent with other filters.

What do I get with [[foobar]getjson[]]?

It's the same process as the [[foobar]getjson[d]] example; the flattened values are returned.

Jermolene avatar Mar 11 '22 10:03 Jermolene

In terms of naming and syntax for operators, I wonder if something along these lines might be more intuitive: [[foobar]json:get[a]] or [[foobar]json-get[a]] instead of [[foobar]getjson[a]]

At first I also thought like this, but then the operators look very similar, at least for me.

getjson, typejson, indexesjson ... IMO you immediately see what is meant jsonget, jsontype, jsonindex ... is much less visual difference. ...

I personally would like get-j, type-j and index-j ... to get shorter filter strings and less typing. But that would obviously be a different naming convention than the existing operators.

pmario avatar Mar 11 '22 10:03 pmario

My primary concern is that to the uninitiated, getjson[] reads as if it returns JSON. So it is worth trying to think of better alternatives before we settle for the current proposal.

saqimtiaz avatar Mar 11 '22 11:03 saqimtiaz

It's the same process as the [[foobar]getjson[d]] example; the flattened values are returned.

hmm [[foobar]getjson[d]] --> "four", "five", "six", "true", "false", "null", "max", "may", "maize"

OK so a structjson operator could be [[foobar]structjson[d]] and return: "e:'four'" "f[0]:'five'" "f[1]:'six'" "f[2]:true" "f[3]:false" "f[4]:null"

Will need to play with the code at the weekend

Edit: did edit the code snippets to "e:'four'" "f[0]:'five'" "f[1]:'six'" "f[2]:true" "f[3]:false" "f[4]:null"

pmario avatar Mar 11 '22 11:03 pmario

OK so a structjson operator could be [[foobar]structjson[d]] and return: "d.e:'four'" "d.f[0]:'five'" "d.f[1]:'six'" "d.f[2]:true" "d.f[3]:false" "d.f[4]:null"

The trouble with that is that the resulting strings are pretty much useless; it would take a lot of work to parse the results.

Jermolene avatar Mar 11 '22 11:03 Jermolene

My primary concern is that to the uninitiated, getjson[] reads as if it returns JSON. So it is worth trying to think of better alternatives before we settle for the current proposal.

I agree, I do think we need to come up with something better. @pmario has a good point that it's easier to read if "json" is a suffix rather than a prefix.

Jermolene avatar Mar 11 '22 11:03 Jermolene

OK so a structjson operator could be [[foobar]structjson[d]] and return: "d.e:'four'" "d.f[0]:'five'" "d.f[1]:'six'" "d.f[2]:true" "d.f[3]:false" "d.f[4]:null"

The trouble with that is that the resulting strings are pretty much useless; it would take a lot of work to parse the results.

Not really. It's returns key:value pairs in 1 filter run. Whereas "flat" result lists most of the time need 2 or more filters to create useful outputs. .. That was one reason, why I did create the keyvalues plugin.

I did think the same looking at: [[foobar]getjson[d]] --> "four", "five", "six", "true", "false", "null", "max", "may", "maize" .. but both outputs may have a usecase. ...

As I wrote: I need to play with the new code a bit more

pmario avatar Mar 11 '22 11:03 pmario

Hi @pmario

Not really. It's returns key:value pairs in 1 filter run. Whereas "flat" result lists most of the time need 2 or more filters to create useful outputs. .. That was one reason, why I did create the keyvalues plugin.

But my point is that there is brittle and error prone parsing of the resulting strings, which I want to avoid. If we wanted to return structured data as a single list item then the logical choice would be to return a literal JSON string, and have operators to work with it, but I don't want to do that because it would bypass the caching. The mechanism presented here should work with multimegabyte JSON files.

Jermolene avatar Mar 11 '22 12:03 Jermolene

Broadly, I like this.

I'm wondering if term json is redundant? Perhaps:

[[foobar]from[a]] --> "one"
[[foobar]from[d],[e]] --> "four"
[[foobar]from[d],[f],[0]] --> "five"

[[foobar]typeof[a]] --> "string"

[[foobar]indexesof[d],[f]] --> "0", "1", "2", "3", "4"

I'm assuming, of course, the parser knows it's in json-land after parsing "[foobar]".

In future (onward dev) , I think avoid the clash of concerns in $action-setfield (because fieldnames can contain dots) by introducing...

<$action-setindex $tiddler=foobar $index=d.f.0 $value="..." />

Of, if setfield is prefered, a new attribute, $indexed

<$action-setfield $tiddler=foobar $indexed=d.f.0 $value="..." />

CodaCodr avatar Mar 11 '22 13:03 CodaCodr

I'm assuming, of course, the parser knows it's in json-land after parsing "[foobar]".

Sadly, it doesn't know the type of the input at the time the filter expression is parsed.

In future (onward dev) , I think avoid the clash of concerns in $action-setfield (because fieldnames can contain dots) by introducing..

The trouble is that we need to allow for computed indexes, implying something like this:

<$action-setjson $tiddler=foobar $0="d" $1="f" $2="0" $3=<<variable>> $value="..." />

Jermolene avatar Mar 11 '22 15:03 Jermolene

<$action-setjson $tiddler=foobar $0="d" $1="f" $2="0" $3=<> $value="..." />

Ouch. That's painful. But yes, I get it. The $n gives you the index position(s) which you absolutely need.

What about...

<$action-setjson $tiddler=foobar $at="""d.f.0.<<var>>""" ... />

which implies one could even...

<$let var={{{ complex-construct }}}>
<$action-setjson $tiddler=foobar $at=<<var>> ... />

CodaCodr avatar Mar 11 '22 15:03 CodaCodr

What about...

Interesting idea! But it appears to introduce another new syntax that will need parsing and error checking...

Jermolene avatar Mar 11 '22 15:03 Jermolene

<$action-setjson $tiddler=foobar $at="""d.f.0.<<var>>""" ... />

IMO it needs to be $at="""d.f[0]""" $value="something" if you want to write to an array. ... Or we completely "skip" array access and only allow object key:values

pmario avatar Mar 11 '22 16:03 pmario

it appears to introduce another new syntax

I was aware of that but considered we were in new-syntax territory anyway, if...

  1. we're having to deal with json topology (i.e. non-flat data streams)
  2. we agree that $0..$n is laborious to type and read
  3. we agree that a.b.c.<<thing>> maps better cognitively onto json

CodaCodr avatar Mar 11 '22 16:03 CodaCodr

Yes, I think @pmario is correct, there needs to be an array "signal", otherwise a prop named "2" is indistinguishable from array index 2.

CodaCodr avatar Mar 11 '22 16:03 CodaCodr

Yes, I think @pmario is correct, there needs to be an array "signal", otherwise a prop named "2" is indistinguishable from array index 2.

Conveniently, there is no difference. The following is true in JS:

var a=[1,2,3];a[2] === a["2"]

Jermolene avatar Mar 11 '22 16:03 Jermolene

In any case, this PR is about these operators. This discussion about setting JSON properties is outside of the scope of this PR, and should take place elsewhere, please!

Jermolene avatar Mar 11 '22 16:03 Jermolene

I think @joshuafontany should be alerted about this discussion.

twMat avatar Mar 11 '22 22:03 twMat

Yes indeed, thank you @twMat.

@joshuafontany this is much narrower in scope than your JSON mangler plugin, but I'd be interested in your thoughts.

Jermolene avatar Mar 12 '22 09:03 Jermolene

This is a fantastic evolution of the TW json-abilities. Excellent ground work. I am only halfway through the thread and ran across this comment. I will catch up after responding:

My primary concern is that to the uninitiated, getjson[] reads as if it returns JSON. So it is worth trying to think of better alternatives before we settle for the current proposal.

Why not have it return a json-string that can be parsed?, so that [[json:/tiddler1]getjson[a],[b],[c]indexesjson[]] is the equivalent of [[json:/tiddler1]getjson[a]subfilter<myFilter>] where <myFilter> = [parsejson[]getjson<secondLevel>,<thirdLevel>indexesjson[]] & the other vars are b abd c respectively, In this way we can pass json objects to sub-filters operators, past filter-run-prefix gaps, etc, etc. Reading on.

[Edit] Got to the Caching concerns. Aaaaah. Reading on. :)

joshuafontany avatar Mar 12 '22 20:03 joshuafontany

@Alzacon suggests renaming the operators getdata,indexesdata, and typedata to reflect the fact that they work with all data tiddlers, not just JSON tiddlers.

I can see the argument, but I worry that "data" is such a generic word that the usage of the operators is less obvious.

Jermolene avatar Mar 14 '22 14:03 Jermolene

I could see the opportunities there to extend "data" to include CSV tiddlers (character separated values) with a 2 part index-path.

joshuafontany avatar Mar 15 '22 00:03 joshuafontany

I was thinking something simitar to the argument of this comment of @saqimtiaz , I agree with him. (It is more intuitive , at least for me, as a non-native English speaker)

In terms of naming and syntax for operators, I wonder if something along these lines might be more intuitive: [[foobar]json:get[a]] or [[foobar]json-get[a]] instead of [[foobar]getjson[a]]

I was thinking that it is easy to understand as operator for data tiddlers with using data as prefix, for example with something like data-get or dataget would be where you can get the value for specific index/key.

I'm also fine with alternatives based on the ideas suggested by @CodaCodr , for example somethig like: datafrom dataindex dataindexes datatypeof

Alzacon avatar Mar 15 '22 19:03 Alzacon

I think it would be helpful to support JSON path as well. It's an existing standard so documentation might be easier.

One use case I've had in the past is having a bunch of data as JSON in an array and wanting to iterate over it via something like the ListWidget. Maybe that could be done via a new widget? At the very least, it should be possible to write such a widget that cooperates with this macro so it would be helpful for the macro to have a direct option where the input is treated as JSON string(s) rather than a list of tiddlers.

slaymaker1907 avatar Mar 22 '22 22:03 slaymaker1907

Before merging this PR, I think it's important to hammer out the details of the syntax. It's probably best if we reuse an existing syntax, like jsonpath or jq, as much as possible rather than inventing a new syntax that's unique to TW. That may not be possible in all cases (for example, jq syntax uses [] for array access, which would be problematic in filters) but since I haven't seen jq being considered in my (admittedly very quick) skim of the PR discussion so far, I thought I'd bring it up.

Now, there are many parts of jq syntax that would not be suitable for this filter operator (it can do output as well as input, so it has syntax that wouldn't fit into a getjson operator). I am absolutely NOT suggesting that all of jq syntax be used. But things like .a.b.c (note the leading dot) might be a good idea to consider, even if we ultimately settle on a.b.c with no leading dot. (I certainly would prefer getJson[a.b.c] rather than getJson[a],[b],[c], though). So I'm bringing it up as a possibility to discuss; we might be able to take some good ideas from jq syntax and leave its bad ideas behind.

rmunn avatar Mar 23 '22 09:03 rmunn

Hi @joshuafontany

I could see the opportunities there to extend "data" to include CSV tiddlers (character separated values) with a 2 part index-path.

Yes indeed, just as dictionary tiddlers are parsed to JSON we could parse CSV tiddlers to JSON and access the data with these same operators.

I was thinking that it is easy to understand as operator for data tiddlers with using data as prefix, for example with something like data-get or dataget would be where you can get the value for specific index/key.

The "jsontype" operator is specific to the way that JSON data is stored, which makes me think that these are really not generic data operators, they are specific to JSON data. As touched on above, the way to make other data formats accessible is to transform them into JSON (which can be done dynamically as we do for dictionary tiddlers).

I think it would be helpful to support JSON path as well. It's an existing standard so documentation might be easier.

Do you mean the "JSON Path" described here?

That kind of syntax is useful, but the trouble is that it is a single string which means that to construct one of those paths dynamically in wikitext requires some kind of string substitution, and of course will require very careful encoding to handle special characters.

As it happens, the syntax here does somewhat resemble the bracket-based format of JSON Path:

$['store']['book'][0]['title']
vs.
[jsonget[store][book[0][title]]

One use case I've had in the past is having a bunch of data as JSON in an array and wanting to iterate over it via something like the ListWidget. Maybe that could be done via a new widget? At the very least, it should be possible to write such a widget that cooperates with this macro so it would be helpful for the macro to have a direct option where the input is treated as JSON string(s) rather than a list of tiddlers.

That can be done with the "indexesjson" operator and the existing list widget. I'll edit the OP to add an example.

Before merging this PR, I think it's important to hammer out the details of the syntax. It's probably best if we reuse an existing syntax, like jsonpath or jq, as much as possible rather than inventing a new syntax that's unique to TW. That may not be possible in all cases (for example, jq syntax uses [] for array access, which would be problematic in filters) but since I haven't seen jq being considered in my (admittedly very quick) skim of the PR discussion so far, I thought I'd bring it up.

Now, there are many parts of jq syntax that would not be suitable for this filter operator (it can do output as well as input, so it has syntax that wouldn't fit into a getjson operator). I am absolutely NOT suggesting that all of jq syntax be used. But things like .a.b.c (note the leading dot) might be a good idea to consider, even if we ultimately settle on a.b.c with no leading dot. (I certainly would prefer getJson[a.b.c] rather than getJson[a],[b],[c], though). So I'm bringing it up as a possibility to discuss; we might be able to take some good ideas from jq syntax and leave its bad ideas behind.

I have considered using an existing syntax, but as discussed above consider that it is more important to be able to compose paths dynamically in a natural way.

However, I think there are still some usecases for being able to represent paths within a JSON object as a single opaque string. They cover the situations where one wants to store and reuse a path without knowing it's "shape" in advance.

So, we could consider an optional suffix for the json operators that selects between different ways of identifying the target node. For example:

[jsonget:path[store][book][0][title]]
vs.
\define mypath() $.store.book[0].title
[jsonget:direct<mypath>]

We'd also need a way to turn a path into a direct string:

[jsonpath[store][book][0][title]] --> $.store.book[0].title

But, two different ways of achieving the same thing seems unsatisfactory, so I'm not sure it's worth it. In any event, these enhancements would be backwards compatible and so could be retrofitted later.

Jermolene avatar Mar 24 '22 11:03 Jermolene

@Jermolene I was using "JSON Pointer" notation for my JsonMangler plugin, which is a library that allows you to use / as a path separator, and also allows you to "encode/decode" paths if you need a literal / in the path. JSON Pointer only defines how to access to a single value inside a json document. JSON Path has more powerful features like union to extract multiple values, filters and deep search (recursively match a path). We'd need to decide on which way to go, but I can even see a flag that would allow you to specify if you are passing a single Pointer string param, or a Path/multiple params.

joshuafontany avatar Mar 26 '22 00:03 joshuafontany

<$action-setjson $tiddler=foobar $at="""d.f.0.<<var>>""" ... />

IMO it needs to be $at="""d.f[0]""" $value="something" if you want to write to an array. ... Or we completely "skip" array access and only allow object key:values

For JsonMangler, I had to parse any existing data at that index to determine if its an array or object, check if the incoming key is numeric or alphanumeric, and convert the existing data or spawn an appropriate array/object if none yet exist in that path. Its not brittle, but it does take good test coverage. [Edit, just reread the thread & saw the request to set aside write-to-index discussion. Np, will ref later.]

joshuafontany avatar Mar 26 '22 07:03 joshuafontany