stream-json
stream-json copied to clipboard
JSONL parser and Pick
Hey! First of all, thanks for your library.
For quite some time I worked with your library to process the following JSON file (example):
[
{"id":"1111","data":{"foo":"foo"}},
{"id":"2222","data":{"foo":"foo","bar":"bar"}}
]
And this pipeline:
var stream_json_parser = require("stream-json").parser;
var stream_json_pick = require("stream-json/filters/Pick").pick;
var stream_json_object = require("stream-json/streamers/StreamObject");
var _regex = new RegExp("data");
let _pipeline = stream_chain([
_stream,
stream_json_parser(),
stream_json_pick({
filter : _regex
}),
stream_json_object.streamObject()
]);
Given my previous example, that pipeline would output the following result:
{ "key": "foo", "value": "foo" }
{ "key": "foo", "value": "foo" }
{ "key": "bar", "value": "bar" }
I'm now changing my file format to JSONL. So the file is now:
{"id":"1111","data":{"foo":"foo"}}
{"id":"2222","data":{"foo":"foo","bar":"bar"}}
I'd like to adapt my pipeline, but I'm quite stuck. I know about the jsonStreaming
option, this works, but my files contains hundreds of thousands of lines, and I saw that you provide stream-json/jsonl/Parser
that seems to be more performance-focused ("The only reason for its existence is improved performance").
So I tried stream-json/jsonl/Parser
, but my pipeline output is now empty. From my understanding, it's because Pick
doesn't accept the StreamValues
.
Any insight on how to adapt my existing pipeline? Thanks!
The major difference is that JSONL is a huge stream of small objects, where "small" is defined as "fit in the memory and can be converted to a JS object". You should still use the main Parser
if this assumption doesn't hold.
Parser
streams tokens. Pick
operates on tokens picking them using a regular expression or a function passing through a part of a token stream. Then we can work on it by assembling the pieces we need into objects or doing more token editing.
jsonl/Parser
reads a line and assembles the whole line into a top-level object. Now you are in the JS realm and can pick up parts using a function.
So for JSONL the pipeline should be something like that: (an unverified sketch):
let _pipeline = stream_chain([
_stream,
stream_jsonl_parser(), // JSONL!
function* ({value}) {
if (value.data.foo) yield {key: 'foo', value: value.data.foo};
if (value.data.bar) yield {key: 'bar', value: value.data.bar};
// replace the function body with better, more realistic, and more robust code
// if you want to dump all keys of data, use Object.keys() or Object.entries()
// instead of a generator function you can use many() of stream-chain (can be more performant)
}
]);
If you need to use the main parser, use jsonStreaming
option. The rest of the pipeline will be similar. The only difference will be _regex
you use to pick objects (or a function that checks a path). In your original case, you used paths like that: "0.data.foo"
. In the JSONL case, they will be "data.foo"
— top-level objects are not numbered and considered separately, while the original case had an array as a top-level wrapping object.