jiffy icon indicating copy to clipboard operation
jiffy copied to clipboard

Add option max_levels to decode

Open Kuroneer opened this issue 4 years ago • 10 comments

This PR adds the option max_levels to jiffy:decode. This option allows you to provide a {max_levels, non_neg_integer()} as option. Jiffy then decodes un to N levels for the JSON, and the ones deeper than that are returned as {json, binary()} instead.

This PR couples nicely with #139, allowing to jiffy:encode(jiffy:decode(Binary, [{max_levels, X}]))

Why is it useful? In my case, I use erlang+jiffy to route streamed jsons to different endpoints based on a routing key that's near the top of the document. Decoding and encoding the whole document every time seems to be wasteful. A simple test with this shows a great improvement for me, however, I don't know if it'd be worth to include for everyone:

2> timer:tc(fun() -> [jiffy:decode(JSON, [return_maps]) || _ <- lists:seq(1, 10000)], ok end).
{2355549,ok}
3> timer:tc(fun() -> [jiffy:decode(JSON, [return_maps, {max_levels, 1}]) || _ <- lists:seq(1, 10000)], ok end).
{564704,ok}

Kuroneer avatar Mar 24 '20 16:03 Kuroneer

I'm not immediately in love with this. My first thought is wether https://github.com/talentdeficit/jsx would be a better fit with its SAX style parsing? Do you need to modify the JSON as you route it? I've been reluctant to merge a PR like #139 because it means that Jiffy's output isn't guaranteed to be correct since its not checking the pre-encoded data.

davisp avatar Mar 25 '20 19:03 davisp

Thanks for pointing me to jsx, I didn't know about that project. However, if I read it correctly, it would not suit my needs because unfortunately, yes, I need to modify the root node, I'd need to test the performance too, because this PR avoids creating binaries and such that are not going to exist in the end result.

I understand your concerns, would you be more keen if I included some way to validate {json, binary()} tuples to #139 (with an option to trust the binaries)? I guess that since it's a json_value(), it'd suffice to try to jiffy:decode the binary with max_levels to 0 (which would need to be allowed, right now it's a pos_integer()). Also, I'd need to make sure that all the checks are kept, as some of them are avoided if the term is not actually needed.

Kuroneer avatar Mar 25 '20 21:03 Kuroneer

Yeah, JSX being an option would have relied on that key being "early" in the JSON doc and not require mutation.

My biggest concern around partially encoded JSON is around Jiffy emitting malformed JSON. If we can come up with an implementation that is not overly complex and can be guaranteed to not produce invalid JSON then I'd be on board.

Previously when I've noodled over this I was expecting to almost have two parser implementations which I more or less rejected as an option since it doubles the amount of logic that needs testing. However, this PR makes me think there might be a decent approach to handling things such that normal uses of the API would be able to guarantee things enough for me (I'm ignoring cases of users that reverse engineer any internal state to intentionally side step guarantees, or in other words, it doesn't have to be iron clad, just not allow for accidental malformed JSON).

I hadn't really thought about attempting a no-op decoding approach like you did for your max levels. I think what leads to an interesting approach along the lines of:

jiffy:validate(binary(), Opts) -> true|false.

Implementing a validation feature requires that we tweak the decoder structure to run through all checks while not generating Erlang terms. Throwing a boolean on the Decoder struct and then gating all calls would be straightforward I believe. Note that this is different than your current approach since it bypasses some checks.

Once we have that, then I'd add a second function that was something like:

jiffy:prepare_partial(Binary, State, Options) -> partial_json().`
jiffy:extract_partial(partial_json()) -> {Binary, State}.

This would use the no-op decoding to ensure that the partial binary is in fact valid JSON. The state is required to know that we stop and end at the appropriate places. It'd be something along the lines of value | kv_pairs | array_elems atoms. The return value would be a resource object that contained the state information and a reference to the binary. This could then be placed within any Erlang JSON representation at an appropriate location.

When we then go through encoding we'd extend the partial encoding loop that currently only handles bignums to stitch the JSON together which will require us adding/checking comma separators and the like and making assertions on the partial json. I.e., that you can't stick KV pairs inside an array and generate something like [1.0, "foo": "bar", false] or something weird.

Then getting back to your original goal of having a max_levels decoding, we'd just flip on the partial encoding flag dynamically during decoding and return the opaque references that can then be fed back into the encoder without a second pass (which I am assuming covers your ask for disabling re-checking already validated JSON).

And then after all of that we write a whole bunch of tests.

davisp avatar Mar 26 '20 20:03 davisp

One thing I'd note is that this approach does not allow something funky that affects multiple levels from multiple places. I.e., having two places where partial JSON is inserted to create valid JSON after encoding. I.e., having something like:

[
    {json, <<"{\"foo\": {\"bar\":">>},
    true,
    {json, <<"}}">>}
]

Or something. But I can't even figure out what that would look like.

davisp avatar Mar 26 '20 20:03 davisp

Though, it occurs to me that we could add an encoder option that would encode values and return partial_json() resources and then you could do something like:

Partial = jiffy:encode(Term, [partial]),
jiffy:encode({[{<<"foo">>, {[{<<"bar">>, Partial}]}]}).

davisp avatar Mar 26 '20 20:03 davisp

I'll work on it and give it a try

Kuroneer avatar Mar 29 '20 00:03 Kuroneer

Although I'm not completely sold on resources due to the serialization limitation and the need for a process-independent environment to hold the binary, it works nicely.

Can you check the PR and give me your thoughts?

Kuroneer avatar Apr 07 '20 03:04 Kuroneer

(sorry for the late reply, it's been a really hectic week)

Just to note the changes, in 89ac61b I've updated the validator to check everything. The only thorn left is building the binary for a string if it has_escapes to make use of the constructed binary, I'll see how it ends playing with the offset and the buffer. (Objects are ok by construction)

In dc1fd56 the resource is created and returned, currently only for json_values().

I'll work next on having a validate(), (which in my head is only a alias for {max_levels,0})

Kuroneer avatar Apr 07 '20 03:04 Kuroneer

jiffy:validate/1,2 ready.

Kuroneer avatar Apr 07 '20 14:04 Kuroneer

Looks like you're going in the right direction. I'll try and find time to give this a thorough review.

davisp avatar Apr 07 '20 16:04 davisp