jqr
jqr copied to clipboard
Read from connection
It would be nice to be able to run a jq pipe from a connection (such as a URL, file or fifo), so that one can avoid loading all the JSON into memory.
I note that jq 1.5 has additional streaming options, including allows for to output in ndjson format, which means that one would be able to run JQ on large JSON, and pipe it to jsonlite::stream_in
.
Not sure what status is with jq 1.5 - is it easy enough to upgrade to that @jeroenooms ? I remember something about it being problematic
Seems like stream_in
always returns a data.frame though, yes?
I have updated master to jq-1.5... not sure yet how we would go about implementing a connection filter.
@noamross is your JSON simple enough where you could do something like
x <- '{
"stuff": [
{"a":1, "b":2},
{"a":3, "b":4},
{"a":5, "b":6},
{"a":7, "b":8},
{"a":9, "b":10}
]
}'
file <- tempfile()
writeLines(jqr::jq(x, ".stuff[]"), con = file)
jsonilte::stream_in(file(file))
#> a b
#> 1 1 2
#> 2 3 4
#> 3 5 6
#> 4 7 8
#> 5 9 10
@sckott I think his problem is that x
is too large to hold in memory...
yeah
Thanks. Yes, the problem is the JSON is biggish (~1GB), and the goal of the package is to get data into the hands of people who may not have the best computers with much memory. But ATM I'm having some success getting our data providers to switch to ndjson, so no urgency - gonna stream in the data, reduce and store locally.
On Thu, Dec 22, 2016, 1:04 PM Scott Chamberlain [email protected] wrote:
yeah
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/jqr/issues/51#issuecomment-268856040, or mute the thread https://github.com/notifications/unsubscribe-auth/AAi5aC5Ff6-oRoXvN1UQzGWNQ2N_hIWPks5rKruQgaJpZM4LTGJj .
It would be nice to see if this can be generalized to streaming data, but that will probably require rewriting some of the internals. Not sure how @richfitz feels about that.
I have zero objections, but am on holiday through mid January. Feel free to try whatever; I think the c++ bits are pretty simple/minimal at present
OK I'll mess around and we can discuss it IRL next month when we're both super jetlagged.
👍 for the ability for jqr
to operate on an external file without having to load the whole thing into memory.
From a purely user-interface / convenience though, it would be nice for jq
function to be able to take a file path or URL as an argument, just so that the interface is consistent with the interface of things like fromJSON
, jsonld_*
fns, etc.
thanks @cboettig
we are on jq 1.5 now
jqr::jq_version()
#> [1] "1.5rc2-174-g597c1f6"
so can start to play with its streaming ability and see if that helps.
@jeroen any thoughts on this? not sure how to add option to have a file or URL connection - right now jqr is set up to only allow character input -
it doesn’t make sense to e.g, read in json from a file on the R side - but rather have jq do that. Seems like it can with -f
flag, but we don't have that flag implemented
I've got the same problem https://stackoverflow.com/q/48509869/199217
(Ultimately solved by converting to netcdf but now have to go back to the source for QAQC)
thanks for the bump @dlebauer
still hoping to get this sorted out soon, seems like a pretty big use case
I just pushed this to master.
@noamross @dlebauer @cboettig can you reinstall from master and give it a shot
Thank you! It took me a bit to figure out how to use this, but here's what I did:
z <- character()
p <- jqr_new('map(.some_field)')
j <- curl::curl_fetch_stream("https://my.big.json/service", function(x) {
z <<- paste0(z, (jqr_feed(p, rawToChar(x))))
})
jqr_feed(p, '', finalize = TRUE)
output <- jsonlite::fromJSON(z)
The output worked as expected, but my R memory usage still spiked (the big remote JSON I tried is ~600MB, and RAM usage spike by ~2GB). I'm not sure if I'm doing something wrong here. I did try putting a gc()
call into the callback function. That doesn't seem to be doing the trick (and makes it very slow).
I'm sorry for the lack of explanation :-) You should use base::url()
or curl::curl()
and use that as you would with regular data. Here is an example:
con <- base::url("http://jeroen.github.io/data/diamonds.json")
results <- jq(con, 'select(.price > 15000)')
df <- jsonlite::stream_in(textConnection(results))
This will buffer the jq output in memory. If your data is really big you may want to write it to disk first:
con <- base::url("http://jeroen.github.io/data/diamonds.json")
jq(con, 'select(.price > 15000)', out = file("output.json"))
df <- jsonlite::stream_in(file("output.json"))
unlink("output.json")
Thanks, that is much easier! The memory usage still seems to spike to >2X the size of the whole file, though, when I'm using out = file("output.json")
. One thing I note is that as the memory usage grows, output isn't streaming to the file. Everything is written to the file at once after memory has reached is peak and is dropping.
Hmmm are you using a jq command that operates on all data at once (such as min/max)? Try setting out = stdout()
to see when output gets generated.
@jeroen I'm not familiar with the query syntax - I've been trying to follow https://stedolan.github.io/jq/tutorial/ and am using this file: https://www.dropbox.com/s/qzzn13e77bz5k8y/2018-01-05_16-40-48_environmentlogger.json?dl=0 from which I want to extract, say, every 20th element named 'spectrum'.
But when I try this I get errors, and the errors can be reproduced in this MWE:
file <- tempfile()
writeLines(jsonlite::toJSON(list(a=1, b = list(c=3, d=4))), con = file)
jq(file, '.')
Error: Invalid numeric literal at EOF at line 1, column 30
(I would expect this simple query to return the entire contents of the file)
@dlebauer You need to wrap the path in a file()
otherwise it is interpreted as literal json:
tmp <- tempfile()
writeLines(jsonlite::toJSON(list(a=1, b = list(c=3, d=4))), con = tmp)
jq(file(tmp), '.')
Thanks - that worked. Though the performance seems much slower than jsonlite::fromJSON()
> system.time(z <- jq(file(metfile), "."))
user system elapsed
77.771 0.162 78.984
> system.time(z <- jsonlite::fromJSON(metfile))
user system elapsed
1.238 0.067 1.395
Queries like jq(file(metfile), ".spectrum[1]")
(is that the correct way to get the first element named 'spectrum'?) take even longer.
What is metfile? Note that jq is for streaming ndjson data, whereas fromJSON reads a single json objects, so they do completely different things.
@jeroen sorry the metfile is the sample from a previous comment ... here is the broken query with the missing download step required to generate the metfile
metfile <- 'met.json'
download.file('https://www.dropbox.com/s/qzzn13e77bz5k8y/2018-01-05_16-40-48_environmentlogger.json',
destfile = metfile)
jq(file(metfile), ".spectrum[1]")
but ... now I see that this is not ndjson, which is the source of my confusion. Sorry!
I'm using '.map()'
, but I realized in your comment above that jq is expecting ndjson, while I'm passing a large array. In the command line, I use --stream
to convert the single large array to ndjson before using a filter, like so:
curl https://dl.dropboxusercontent.com/s/b6cvwndezgpyiyi/test.json | jq --stream 'fromstream(1|truncate_stream(inputs)) | .mpg'
I'm not sure if using the connection in your case supposed to be equivalent to this. If I use 'fromstream(1|truncate_stream(inputs)) | .mpg'
in jq()
in R I get no output.
Interesting, let's see if I can pass that flags somewhere. One sec.
I added a stream
parameter to jq_flags()
. I think this is similar to --stream
. Can you try this?
Hmm I think there is another problem that I am internally using readLines()
but your json doesn't have any linebreaks. I'll have to look into this more closely.