fast icon indicating copy to clipboard operation
fast copied to clipboard

parse stream?

Open Laeeth opened this issue 9 years ago • 1 comments

Hi Marco.

Small enhancement request. (Apologies if it's implemented already and I didn't see).

Quite often one wants to parse a JSON stream (like from Twitter or the Reddit comment dump). It would be nice to have that implemented as part of the library, so it's very easy to use. I have written a small range to do this, but it's quite crude, and I haven't paid attention to efficiency. I can make a pull request if you would like (and you can refine it later), but you may prefer to implement yourself - let me know.

Here is some very simple code to process Reddit comments: https://gist.github.com/Laeeth/bbd08dd576cb7aeff444

The original comments are here: https://archive.org/details/2015_reddit_comments_corpus

On one core it takes 35 minutes to process one month's data (35 Gig).

Thanks for getting in touch by email. That was about something else - have had to figure out some other things but will respond shortly.

Laeeth.

Laeeth avatar Dec 05 '15 14:12 Laeeth

Sorry for the late answer. The dilemma is that treating the entire JSON text as one memory block is fundamental for the code as it stands. Most data in JSON has no length limit (strings, numbers), so a lot of places need to become aware of the sliding window that comes with streaming. I do see the use case with huge files and think that maybe RapidJSON can shine here. I'll keep the report open anyways.

mleise avatar Jun 08 '16 22:06 mleise