jq
jq copied to clipboard
parsing new line delimited json
Hi! I'm trying to parse multiple TB log files. Each log file is about 20 GB and contains new line delimited json. There's a json on every line. The problem is it being unstructured. Every line may or may not have some keys.. I'm trying to create a sql warehouse or use spark or something( I'm open to any new tool if recommended) to parse this huge log files. Any thoughts? Should I use jq to parse and dump into something like Redshift? Is it even possible with jq?
Here's my current workaround (for bash):
while read -r line; do echo "$line" | jq '.filter.goes.here'; done < inputfile
@renderit - jq handles JSON streams, by design. Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special. (With jq 1.5, you have the additional option of using inputs
.)
I'm not quite sure what your concerns are -- as far as I can tell, you certainly don't wan't to be doing what @paulmelnikow suggested -- but I was thinking that perhaps they might be addressed in a short introduction to jq that I wrote: "A Stream-Oriented Introduction to jq". It's just a first draft, so I'd appreciate your feedback.
Indeed, this works just as well. I'm not dealing with a large file.
jq '.filter.goes.here' < inputfile
Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special.
Would be great to add this info to the docs!
does not make more sense representing them as top-level arrays?
so you could extract subparts by ranges:
jq .[0:3] input.ndjson
@eadmaster — Here's an alternative if you need to read a specific range of lines of a file.
jq '.filter.goes.here' < sed -n '10,15p;16q' inputfile
In the above example, it only reads lines 10
through 15
. The 16q
instructs sed to exit on the next line to prevent scanning the remainder of inputfile
.
Though, I do agree, it'd be nice to have a range option in jq or the option to access as a top-level array.
Indeed, this works just as well. I'm not dealing with a large file.
jq '.filter.goes.here' < inputfile
Thus, even jq 1.4 should easily handle "JSON Lines", without your having to do anything special.
Would be great to add this info to the docs!
Has this been added to the docs? I didn't find it there, and I think it's a great addition. I was trying the first option with do-while
bashing; and I almost missed this one!
Can you please add a simple line that checks if the file extension is the .ndjson
then parse the file as a stream?
Because now when I tried to parse:
$ jq resp.ndjson
jq: error: resp/0 is not defined at <top-level>, line 1:
resp.ndjson
jq: 1 compile error
But the jq < resp.ndjson
worked great
@stokito I think you want jq . resp.ndjson
. The reason jq < resp.ndjson
works i think is because jq assumed the program to be "." if no program argument was specified and stdin or stdout is not a tty (https://github.com/stedolan/jq/blob/master/src/main.c#L610), handy but a bit confusing.
@wader thank you for the tip, I'll use it. Still I believe that jq can switch to the stream mode itself if extension is ndjson
Not sure i follow what you want different from the current default behaviour which is to keep reading json values from the inputs as long as there is only whitespace between them.
$ jq . <(echo -e '{"a":1}\n{"b":2}\n')
{
"a": 1
}
{
"b": 2
}
Or by stream you mean tostream/fromstream
?
This doesn't seem to be an issue at all. Besides, transforming newline delimited json to proper json is as easy as
sed '1 s/^/[/ ; 2,$ s/^/,/; $ s/$/]/' data.json
which can be piped to jq
or whatever at virtually no performance hit, unless perhaps one has absurdly large json objects or something
I came across this issue because I was parsing the output of yt-dlp with jq and it also dumps new-line separated json objects. Thanks to the wording in here I was able to discover jq's --slurp
option and could resolve my issues. Just to clarify, the OP's problem could not be resolved by that because the implementation would actually try to read in all data at once and this is prohibitive with large data?
That was the alternative explanation :D