simd-json
simd-json copied to clipboard
Parsing newline separated JSON is cumbersome
In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp
, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.
Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json
. The usual no-copy BufRead::lines()
workflow is killed by the fact that Lines
yields immutable &str
s while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line
results in unnecessary copying of the line and manual \n
suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).
I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut
which would increase this libraries usability immensely.
It is also very much possible that there is an obvious way to make this work which I just failed to see.
HI!
Frist a bit explenation, the reason why we use &mut [u8]
or str
is that we do use a form of in situ parsing instead of allocating memory for strings we just re-use the existing buffer. There are a few ways around this but none of them are pleasant thanks to rusts borrow checker.
That said with 0.3 simdjson (upstream) has implemented a form of very fast option for parsing new line separated JSON but we didn't had a chance yet to look at this :)
@Licenser If I can be reassuring, the JSON stream parser (that's how we call it) is conceptually simple and involves few lines of code. Porting the idea of it would not be a lot of work. It is also subject to parallelization, which is cool.
Ja, I'm not worried :) the simdjson code is beautiful so it is always a pleasure to port :D just juggling the usual 10000 things to find the time 😂
Not at this is done but #194 is a nicer ticket name for this so I'll combine the two into that,