stream-json icon indicating copy to clipboard operation
stream-json copied to clipboard

JSONL parser and Pick

Open eliottvincent opened this issue 1 year ago • 1 comments

Hey! First of all, thanks for your library.

For quite some time I worked with your library to process the following JSON file (example):

[
    {"id":"1111","data":{"foo":"foo"}},
    {"id":"2222","data":{"foo":"foo","bar":"bar"}}
]

And this pipeline:

var stream_json_parser = require("stream-json").parser;
var stream_json_pick = require("stream-json/filters/Pick").pick;
var stream_json_object = require("stream-json/streamers/StreamObject");

var _regex = new RegExp("data");

let _pipeline = stream_chain([
  _stream,

  stream_json_parser(),

  stream_json_pick({
    filter : _regex
  }),

  stream_json_object.streamObject()
]);

Given my previous example, that pipeline would output the following result:

{ "key": "foo", "value": "foo" }
{ "key": "foo", "value": "foo" }
{ "key": "bar", "value": "bar" }

I'm now changing my file format to JSONL. So the file is now:

{"id":"1111","data":{"foo":"foo"}}
{"id":"2222","data":{"foo":"foo","bar":"bar"}}

I'd like to adapt my pipeline, but I'm quite stuck. I know about the jsonStreaming option, this works, but my files contains hundreds of thousands of lines, and I saw that you provide stream-json/jsonl/Parser that seems to be more performance-focused ("The only reason for its existence is improved performance").

So I tried stream-json/jsonl/Parser, but my pipeline output is now empty. From my understanding, it's because Pick doesn't accept the StreamValues.

Any insight on how to adapt my existing pipeline? Thanks!

eliottvincent avatar Sep 15 '23 20:09 eliottvincent

The major difference is that JSONL is a huge stream of small objects, where "small" is defined as "fit in the memory and can be converted to a JS object". You should still use the main Parser if this assumption doesn't hold.

Parser streams tokens. Pick operates on tokens picking them using a regular expression or a function passing through a part of a token stream. Then we can work on it by assembling the pieces we need into objects or doing more token editing.

jsonl/Parser reads a line and assembles the whole line into a top-level object. Now you are in the JS realm and can pick up parts using a function.

So for JSONL the pipeline should be something like that: (an unverified sketch):

let _pipeline = stream_chain([
  _stream,

  stream_jsonl_parser(), // JSONL!

  function* ({value}) {
    if (value.data.foo) yield {key: 'foo', value: value.data.foo};
    if (value.data.bar) yield {key: 'bar', value: value.data.bar};
    // replace the function body with better, more realistic, and more robust code
    // if you want to dump all keys of data, use Object.keys() or Object.entries()
    // instead of a generator function you can use many() of stream-chain (can be more performant)
  }
]);

If you need to use the main parser, use jsonStreaming option. The rest of the pipeline will be similar. The only difference will be _regex you use to pick objects (or a function that checks a path). In your original case, you used paths like that: "0.data.foo". In the JSONL case, they will be "data.foo" — top-level objects are not numbered and considered separately, while the original case had an array as a top-level wrapping object.

uhop avatar Jan 18 '24 18:01 uhop