simd-json icon indicating copy to clipboard operation
simd-json copied to clipboard

Parsing newline separated JSON is cumbersome

Open athre0z opened this issue 4 years ago • 3 comments

In my experience, most large JSON files where using SIMD decoding would make sense come in a newline separated form. Oftentimes they are additionally stored in a compressed form and only stream-decompressed for parsing, e.g. using unix piping such as lz4 -d < big.json | myapp, allowing for decompression to occur on a second CPU core and parsing in a way that is both memory and disk IO efficient.

Unfortunately, this kind of parsing is not at all straight-forward to do with simd-json. The usual no-copy BufRead::lines() workflow is killed by the fact that Lines yields immutable &strs while simd-json required mutable ones. I couldn't find any documentation on why this is the case, but I assume that simd-json temporarily patches bytes for some of the SIMD magic to work. Using BufRead::read_line results in unnecessary copying of the line and manual \n suffix stripping, being both cumbersome and slower than just using serde-json (in my absolutely non-scientific test run).

I feel like it would be great if this lib could also provide a SIMD accelerated lines_mut which would increase this libraries usability immensely.

It is also very much possible that there is an obvious way to make this work which I just failed to see.

athre0z avatar Apr 15 '20 22:04 athre0z

HI!

Frist a bit explenation, the reason why we use &mut [u8] or str is that we do use a form of in situ parsing instead of allocating memory for strings we just re-use the existing buffer. There are a few ways around this but none of them are pleasant thanks to rusts borrow checker.

That said with 0.3 simdjson (upstream) has implemented a form of very fast option for parsing new line separated JSON but we didn't had a chance yet to look at this :)

Licenser avatar Apr 15 '20 22:04 Licenser

@Licenser If I can be reassuring, the JSON stream parser (that's how we call it) is conceptually simple and involves few lines of code. Porting the idea of it would not be a lot of work. It is also subject to parallelization, which is cool.

lemire avatar Apr 17 '20 00:04 lemire

Ja, I'm not worried :) the simdjson code is beautiful so it is always a pleasure to port :D just juggling the usual 10000 things to find the time 😂

Licenser avatar Apr 17 '20 08:04 Licenser

Not at this is done but #194 is a nicer ticket name for this so I'll combine the two into that,

Licenser avatar Oct 21 '22 12:10 Licenser