Tape reading without alloc
Hi,
This PR is not finished, just a POC to get the discussion started.
So, the idea is to be able to read the tape without allocation. I created a benchmark with those results, where f0 just calls read, and f1 builds the tape, but then uses my non allocating code.
julia> @benchmark f0($str)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 995.800 ns … 174.571 μs ┊ GC (min … max): 0.00% … 99.10%
Time (median): 1.025 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.144 μs ± 3.819 μs ┊ GC (mean ± σ): 7.44% ± 2.21%
▅██▇▆▅▄▃▂▂▃▃▃▃▃▂▂▁▁ ▁ ▂
████████████████████████████▇▇▆▇▆▅▆▇▅▆▄▅▆▅▆▆▆▅▇▇▅▄▆▅▅▆▅▆▅▆▅▅▆ █
996 ns Histogram: log(frequency) by time 1.51 μs <
Memory estimate: 2.58 KiB, allocs estimate: 11.
julia> @benchmark f1($reader, $str)
BenchmarkTools.Trial: 10000 samples with 201 evaluations.
Range (min … max): 394.900 ns … 13.140 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 402.363 ns ┊ GC (median): 0.00%
Time (mean ± σ): 415.439 ns ± 162.861 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁█
██▅▆▄▂▂▃█▆▃▃▂▂▃▅▅▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
395 ns Histogram: frequency by time 506 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
The implementation is completely non-mutable, and has two main concepts:
-
Cursorthey just contain an Int that is an index of the tape, -
JSONItemthat contain the cursor, the tape and the original string. There are some variations of JSONItem, that implement some handy interface given the data they represent: -
JSONField(that represents a field of an object), key and value can be accessed withkeyandvalue -
JSONObject, that represents an object and is used as a dictionary (and that could be anAbstractDict), -
JSONArraythat acts represents an array, and can be iterated over efficiently.
A few details about the implementation:
- To be non allocating, the interface has to be type stable and hence, the user is expected to provide expected types. If he does not, then
JSONItemare returned. - The
JSONItemcontain the tape, and hence should be allocating. Yet, it's generally possible for the compiler to remove to those. In any case, users can use a low level use of the isbitsCursorto not rely on those compiler optimizations. - Objects' fields are not stored in a dictionary. This means that when looking for a key, the tape is walked through. It's possible to continue walking it from a previous cursor, which makes sense when the order of the fields is known to the user. I ran a few benchmarks, and for a reasonnable number of fields, it does not seem to be slower than using a dictionnary. Also, user would be free to iterate over the fields, and store the corresponding cursors in a dictionnary, if it would be more performant for them.
- Similarly, arrays are not stored in an
Vector.
@quinnj Are you interested?
To pursue this PR, I would like to:
- Rename those objects. The JSON prefix is useless, but you already have an
Objectand anArray... - Implement a proper bound checking. I put some
@inboundsin the code this weekend, they are intended for the benchmark, but they are not acceptable (especially for thegetindexmethods that take aCursoras an argument). - I should raise proper errors when the type in the tape is not compatible with the one requested by the user. Also more importantly, I have not coded the type check for Arrays yet! But that's a complicated one. Typically the eltype in the tape when parsing
"[1.0,1.5]"will beUnion{Int64,Float64}, where the user would most certainly expect a JSONArray{<:AbstractFloat}. So there's a bit of logic to code here to handle those kind of cases.
To go even further:
- Parsing the numbers in the tape could be optionnal. If we encoded them in the tape as we do for strings, then we could parse them 'on demand' with this interface. This would have several advantages:
- faster parsing (when not all fields are consumed),
- the user could specify if the type is an AbstractFloat or an Integer, eliminating the need to branch on the result of 'modf',
- it would solve the difficulty stated above when type checking the arrays.
- Provide a way to deactivate the type checks (for perfomance).
Hey @Poncito! Thanks for opening a PR. Some really interesting ideas here. I've also been thinking here and there about how we can support more lazy/non-allocating queries/workflows. I'm playing with some ideas here. In truth, I'd like to move away from the tape all together, since that itself requires a big upfront allocation and to be held onto via lazy objects. Ideally, I'm hopeful we can figure out a purely functional solution where we/users could make custom JSONBase.AbstractContext to overload different parsing events (objectEnter, objectExit, etc). The idea is that we could then implement lazy querying via a custom context, in addition to regular parsing to a Dict, or even all the StructTypes integration we have in JSON3.jl.
Anyway, I'm not sure if I'll be able push forward on it much in the near term (depends on a few other projects going on), but I'd be happy to discuss some ideas more.
Hey @quinnj , thanks for your answer. I can dedicate a few hours a week of my free time to develop this kind of stuff. So if you have time to coach me on that, I could propose some code.
My understanding of JSONBase.jl and your explanation is the following:
- we want to extract the data from the serialized text in different ways. It could be a dict, a struct (with the StructTypes integration), or also a "tape", etc.
- we you would like a solution that basically separates the "parsing", from the "storage/organisation" of the content. This would remove duplicated code from 'structs.jl' and 'read.jl' for instance. The dict/struct/tape would correspond to different implementations of 'AbstractContext'. Typically, the 'objectValueExit' would either push some stuff on the tape, or insert a pair in a dictionnary.
Am I understanding correctly? I'm really not sure because of your comment on lazy objects. You could also mean that you would like to make the parsing lazy? I'm not sure what it would mean and where the performance would come.
My guess on how to efficiently parse and read a json would be the following dichotomy (and their corresponding 'AbstractContext'?),
- ordered json:
provide some kind of schema to the parsing to trigger a compilation of a specialized code for this schema.
- if the json represents a small serialized stack allocated structure
- build it directly
- otherwise
- use a simplified tape (no need to keep the keys, the types, ...)
- if the json represents a small serialized stack allocated structure
- non ordered json:
- JSON3's tape may be the most efficient solution type with the kind of reading I propose on this PR.
- large (enough) non ordered json:
- dict
From this dichotomy, it appears that not only the "storage" should be abstract, but also the "parsing" (that does not appear in your JSONBase.jl). It is required to exploit the schema. But it can also be used to relax some checks. For instance, if I want to take the risk to suppose that the serialized json is well formed, I could decide to replace
pos + 3 <= len &&
b == UInt8('t') &&
buf[pos + 1] == UInt8('r') &&
buf[pos + 2] == UInt8('u') &&
buf[pos + 3] == UInt8('e')
by
b == UInt8('t')
I could also not read the keys, and directly jump ahead from the size of the expected key (and the next delimiter), ...
I can find some free time if you want to chat: [email protected]
Any updates on this?
I really would like to help since I really need it, the problem is that Im inexperienced, so I might cause some troubles :D
I've been slowly chipping away at some ideas at https://github.com/quinnj/JSONBase.jl. I've gotten pretty far, but haven't been able to do really thorough benchmarking yet to make sure everything is squared away. There's also some polish, package admin, and testing to do, but I think the fundamentals are in pretty good shape. Happy to chat in that repo with anyone who's interested to collaborate and try things out with me.