jq icon indicating copy to clipboard operation
jq copied to clipboard

parsing new line delimited json

Open renderit opened this issue 7 years ago • 11 comments

Hi! I'm trying to parse multiple TB log files. Each log file is about 20 GB and contains new line delimited json. There's a json on every line. The problem is it being unstructured. Every line may or may not have some keys.. I'm trying to create a sql warehouse or use spark or something( I'm open to any new tool if recommended) to parse this huge log files. Any thoughts? Should I use jq to parse and dump into something like Redshift? Is it even possible with jq?

renderit avatar Sep 03 '16 01:09 renderit

Here's my current workaround (for bash):

while read -r line; do echo "$line" | jq '.filter.goes.here'; done < inputfile

paulmelnikow avatar Nov 09 '17 21:11 paulmelnikow

@renderit - jq handles JSON streams, by design. Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special. (With jq 1.5, you have the additional option of using inputs.)

I'm not quite sure what your concerns are -- as far as I can tell, you certainly don't wan't to be doing what @paulmelnikow suggested -- but I was thinking that perhaps they might be addressed in a short introduction to jq that I wrote: "A Stream-Oriented Introduction to jq". It's just a first draft, so I'd appreciate your feedback.

pkoppstein avatar Nov 09 '17 23:11 pkoppstein

Indeed, this works just as well. I'm not dealing with a large file.

jq '.filter.goes.here' < inputfile

Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special.

Would be great to add this info to the docs!

paulmelnikow avatar Nov 10 '17 05:11 paulmelnikow

does not make more sense representing them as top-level arrays?

so you could extract subparts by ranges: jq .[0:3] input.ndjson

eadmaster avatar Jul 09 '18 15:07 eadmaster

@eadmaster — Here's an alternative if you need to read a specific range of lines of a file.

jq '.filter.goes.here' < sed -n '10,15p;16q' inputfile

In the above example, it only reads lines 10 through 15. The 16q instructs sed to exit on the next line to prevent scanning the remainder of inputfile.

Though, I do agree, it'd be nice to have a range option in jq or the option to access as a top-level array.

andyfleming avatar Mar 08 '19 00:03 andyfleming

Indeed, this works just as well. I'm not dealing with a large file.

jq '.filter.goes.here' < inputfile

Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special.

Would be great to add this info to the docs!

Has this been added to the docs? I didn't find it there, and I think it's a great addition. I was trying the first option with do-while bashing; and I almost missed this one!

108krohan avatar Jul 09 '19 14:07 108krohan

Can you please add a simple line that checks if the file extension is the .ndjson then parse the file as a stream? Because now when I tried to parse:

$ jq resp.ndjson
jq: error: resp/0 is not defined at <top-level>, line 1:
resp.ndjson
jq: 1 compile error

But the jq < resp.ndjson worked great

stokito avatar Nov 01 '21 15:11 stokito

@stokito I think you want jq . resp.ndjson. The reason jq < resp.ndjson works i think is because jq assumed the program to be "." if no program argument was specified and stdin or stdout is not a tty (https://github.com/stedolan/jq/blob/master/src/main.c#L610), handy but a bit confusing.

wader avatar Nov 01 '21 16:11 wader

@wader thank you for the tip, I'll use it. Still I believe that jq can switch to the stream mode itself if extension is ndjson

stokito avatar Nov 02 '21 11:11 stokito

Not sure i follow what you want different from the current default behaviour which is to keep reading json values from the inputs as long as there is only whitespace between them.

$ jq . <(echo -e '{"a":1}\n{"b":2}\n')
{
  "a": 1
}
{
  "b": 2
}

Or by stream you mean tostream/fromstream?

wader avatar Nov 02 '21 11:11 wader

This doesn't seem to be an issue at all. Besides, transforming newline delimited json to proper json is as easy as

sed '1 s/^/[/ ; 2,$ s/^/,/; $ s/$/]/' data.json

which can be piped to jq or whatever at virtually no performance hit, unless perhaps one has absurdly large json objects or something

CervEdin avatar Jan 12 '22 12:01 CervEdin

I came across this issue because I was parsing the output of yt-dlp with jq and it also dumps new-line separated json objects. Thanks to the wording in here I was able to discover jq's --slurp option and could resolve my issues. Just to clarify, the OP's problem could not be resolved by that because the implementation would actually try to read in all data at once and this is prohibitive with large data?

stefanct avatar Oct 26 '23 07:10 stefanct

That was the alternative explanation :D

stefanct avatar Oct 26 '23 07:10 stefanct