fast
fast copied to clipboard
parse stream?
Hi Marco.
Small enhancement request. (Apologies if it's implemented already and I didn't see).
Quite often one wants to parse a JSON stream (like from Twitter or the Reddit comment dump). It would be nice to have that implemented as part of the library, so it's very easy to use. I have written a small range to do this, but it's quite crude, and I haven't paid attention to efficiency. I can make a pull request if you would like (and you can refine it later), but you may prefer to implement yourself - let me know.
Here is some very simple code to process Reddit comments: https://gist.github.com/Laeeth/bbd08dd576cb7aeff444
The original comments are here: https://archive.org/details/2015_reddit_comments_corpus
On one core it takes 35 minutes to process one month's data (35 Gig).
Thanks for getting in touch by email. That was about something else - have had to figure out some other things but will respond shortly.
Laeeth.
Sorry for the late answer. The dilemma is that treating the entire JSON text as one memory block is fundamental for the code as it stands. Most data in JSON has no length limit (strings, numbers), so a lot of places need to become aware of the sliding window that comes with streaming. I do see the use case with huge files and think that maybe RapidJSON can shine here. I'll keep the report open anyways.