jsone
jsone copied to clipboard
Streaming decode
A new option stream
:
Decode the input in multiple chunks. Instead of a result or error,
{incomplete, fun()}
is returned. The returned fun takes a single argument and it should called to continue the decoding. When all the input has been provided, the fun should be called withend_stream
orend_json
to signal the end of input and then the fun returns a result or an error.
This is a first working implementation. We have yet to run the benchmark.
I did minimal changes to make it work. Perhaps some refactoring can make it less ugly.
If this makes the core decode implementation slower, we could consider putting the stream decode code in a separate module.
Fixes #73.
Sorry for the delayed response but what is the status of this PR? It's still marked as a draft, so are there any TODOs to make it review-ready? (maybe benchmarking?)
I think it is ready for review. Only benchmarking is missing. I will mark it ready for review.
Is there a way to run the benchmark of a branch and compare it with master?
I see. Thanks. I starts review of this PR.
Is there a way to run the benchmark of a branch and compare it with master?
There is a benchmark script I used at https://github.com/sile/jsone/tree/master/benchmark/run.sh. But it's not well maintained, so feel free to use another or your own benchmark if you favor that.
The CI failures could fix if you run $ rebar3 efmt -w
to format the source code.
I came up with an idea that it might be possible to implement this feature without modifying the jsone_decode
module at all.
The following code shows the ideas.
%% in jsone.erl file
try_decode_stream(Json, Options) ->
case jsone_decode:decode(Json, Options) of
{ok, Value, Remainings} ->
{ok, Value, Remainings};
%% Add handligs of incomplete cases here
{error, {badarg, [{jsone_decode, array_next, Args = [<<>>, Values, Nexts, Buf, Opt]}]}} ->
incomplete(fun jsone_decode:array_next/5, Args);
%% ... other clauses ...
{error, Reason} ->
{error, Reason}
end.
I'm not 100% sure this approach is actually possible but I think that this has obvious merit that it doesn't introduce any performance overhead when this feature isn't used.
That's a very interesting idea. It keeps jsone_decode simple. The case where I'm not sure is when the input is a number split between the digits. In this case, an incomplete input is not an error. E.g. <<"1">>
, <<".23">>
, <<"e45">>
.
I will benchmark with the current implementation first. Then, I might try the badarg-to-incomplete version.
Benchmarks of current version and with this PR, decode only.
jsone = current version | jsone = this PR |
---|---|
##### With input Blockchain ##### Name ips average deviation median 99th % jiffy 3.61 K 277.26 μs ±25.91% 248.63 μs 506.51 μs Jason 2.55 K 392.71 μs ±10.81% 388.40 μs 524.35 μs jsone 1.87 K 534.87 μs ±15.29% 523.80 μs 852.57 μs Tiny 1.46 K 683.48 μs ±10.97% 668.14 μs 932.35 μs Poison 1.37 K 729.87 μs ±17.99% 706.65 μs 1167.41 μs JSX 1.24 K 809.64 μs ±11.73% 796.71 μs 1091.97 μs JSON 0.46 K 2161.80 μs ±9.99% 2130.79 μs 2827.24 μs |
##### With input Blockchain ##### Name ips average deviation median 99th % jiffy 4.17 K 240.01 μs ±23.01% 220.79 μs 380.98 μs Jason 2.65 K 376.88 μs ±10.31% 374.17 μs 488.45 μs jsone 1.74 K 573.40 μs ±10.16% 566.48 μs 768.11 μs Poison 1.53 K 655.55 μs ±12.55% 647.27 μs 920.13 μs Tiny 1.48 K 677.49 μs ±10.19% 666.94 μs 890.66 μs JSX 1.28 K 784.19 μs ±19.22% 756.03 μs 1599.32 μs JSON 0.62 K 1618.36 μs ±6.32% 1605.90 μs 2011.57 μs |
This is done an a laptop. That's why there are big differences between the runs. It is visible that the PR has a slightly negative impact on performance though.
Note: I did not run rebar3 efmt -w
because it causes very many changes, also to code that I didn't touch. It just makes it harder to review. I can do it in a separate commit later.
Thank you for sharing the benchmark result! It's interesting.
The case where I'm not sure is when the input is a number split between the digits. In this case, an incomplete input is not an error. E.g. <<"1">>, <<".23">>, <<"e45">>.
You're right. It could be a difficult point.
I think that the benchmark result is not too bad, but this change certainly seems to have a negative impact on the decoding performance. So, I'd like to consider the possibility of the above approach further.
(It is undecided whether to do it, but I would like to optimize it so that jsone
will be faster someday. Therefore, if possible, I want to avoid performance degradation as much as possible.)
This is also just an idea, but it might be possible to retry the number decoding as the following:
%% in jsone.erl file (the logic could be complicated, so it feels better to create a new module such as jsone_stream.erl, btw)
try_decode_stream(Json, Options) ->
case jsone_decode:decode(Json, Options) of
{ok, Value, Remainings} ->
{ok, Value, Remainings};
{error, {badarg, [{jsone_decode, array_next, Args = [<<>>, Values, Nexts, Buf, Opt]}]}} ->
case Nexts of
%% If the head element of `Nexts` is a number, retry the number decoding when the next stream input is given.
[N | Nexts1] when is_number(N) ->
incomplete(fun jsone_decode:number_integer_part, [jsone:encode(N), Values, Nexts1, Buf Opt]);
_ ->
incomplete(fun jsone_decode:array_next/5, Args)
end;
%% ... other clauses ...
{error, Reason} ->
{error, Reason}
end.