jqr icon indicating copy to clipboard operation
jqr copied to clipboard

Read from connection

Open noamross opened this issue 7 years ago • 32 comments

It would be nice to be able to run a jq pipe from a connection (such as a URL, file or fifo), so that one can avoid loading all the JSON into memory.

I note that jq 1.5 has additional streaming options, including allows for to output in ndjson format, which means that one would be able to run JQ on large JSON, and pipe it to jsonlite::stream_in.

noamross avatar Dec 21 '16 15:12 noamross

Not sure what status is with jq 1.5 - is it easy enough to upgrade to that @jeroenooms ? I remember something about it being problematic

Seems like stream_in always returns a data.frame though, yes?

sckott avatar Dec 21 '16 17:12 sckott

I have updated master to jq-1.5... not sure yet how we would go about implementing a connection filter.

jeroen avatar Dec 22 '16 13:12 jeroen

@noamross is your JSON simple enough where you could do something like

x <- '{
  "stuff": [
    {"a":1, "b":2},
    {"a":3, "b":4},
    {"a":5, "b":6},
    {"a":7, "b":8},
    {"a":9, "b":10}
  ]
}'
file <- tempfile()
writeLines(jqr::jq(x, ".stuff[]"), con = file)
jsonilte::stream_in(file(file))
#>   a  b
#> 1 1  2
#> 2 3  4
#> 3 5  6
#> 4 7  8
#> 5 9 10

sckott avatar Dec 22 '16 17:12 sckott

@sckott I think his problem is that x is too large to hold in memory...

jeroen avatar Dec 22 '16 18:12 jeroen

yeah

sckott avatar Dec 22 '16 18:12 sckott

Thanks. Yes, the problem is the JSON is biggish (~1GB), and the goal of the package is to get data into the hands of people who may not have the best computers with much memory. But ATM I'm having some success getting our data providers to switch to ndjson, so no urgency - gonna stream in the data, reduce and store locally.

On Thu, Dec 22, 2016, 1:04 PM Scott Chamberlain [email protected] wrote:

yeah

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/jqr/issues/51#issuecomment-268856040, or mute the thread https://github.com/notifications/unsubscribe-auth/AAi5aC5Ff6-oRoXvN1UQzGWNQ2N_hIWPks5rKruQgaJpZM4LTGJj .

noamross avatar Dec 23 '16 04:12 noamross

It would be nice to see if this can be generalized to streaming data, but that will probably require rewriting some of the internals. Not sure how @richfitz feels about that.

jeroen avatar Dec 28 '16 19:12 jeroen

I have zero objections, but am on holiday through mid January. Feel free to try whatever; I think the c++ bits are pretty simple/minimal at present

richfitz avatar Dec 28 '16 19:12 richfitz

OK I'll mess around and we can discuss it IRL next month when we're both super jetlagged.

jeroen avatar Dec 28 '16 19:12 jeroen

👍 for the ability for jqr to operate on an external file without having to load the whole thing into memory.

From a purely user-interface / convenience though, it would be nice for jq function to be able to take a file path or URL as an argument, just so that the interface is consistent with the interface of things like fromJSON, jsonld_* fns, etc.

cboettig avatar Aug 10 '17 21:08 cboettig

thanks @cboettig

we are on jq 1.5 now

jqr::jq_version()
#> [1] "1.5rc2-174-g597c1f6"

so can start to play with its streaming ability and see if that helps.

sckott avatar Aug 10 '17 21:08 sckott

@jeroen any thoughts on this? not sure how to add option to have a file or URL connection - right now jqr is set up to only allow character input -

it doesn’t make sense to e.g, read in json from a file on the R side - but rather have jq do that. Seems like it can with -f flag, but we don't have that flag implemented

sckott avatar Aug 11 '17 23:08 sckott

I've got the same problem https://stackoverflow.com/q/48509869/199217

(Ultimately solved by converting to netcdf but now have to go back to the source for QAQC)

dlebauer avatar Jan 31 '18 06:01 dlebauer

thanks for the bump @dlebauer

still hoping to get this sorted out soon, seems like a pretty big use case

sckott avatar Jan 31 '18 17:01 sckott

I just pushed this to master.

jeroen avatar Mar 07 '18 22:03 jeroen

@noamross @dlebauer @cboettig can you reinstall from master and give it a shot

sckott avatar Mar 08 '18 03:03 sckott

Thank you! It took me a bit to figure out how to use this, but here's what I did:

z <- character()
p <- jqr_new('map(.some_field)')
j <- curl::curl_fetch_stream("https://my.big.json/service",  function(x) {
                                z <<- paste0(z, (jqr_feed(p, rawToChar(x))))
        })
jqr_feed(p, '', finalize = TRUE)
output <- jsonlite::fromJSON(z)

The output worked as expected, but my R memory usage still spiked (the big remote JSON I tried is ~600MB, and RAM usage spike by ~2GB). I'm not sure if I'm doing something wrong here. I did try putting a gc() call into the callback function. That doesn't seem to be doing the trick (and makes it very slow).

noamross avatar Mar 08 '18 20:03 noamross

I'm sorry for the lack of explanation :-) You should use base::url() or curl::curl() and use that as you would with regular data. Here is an example:

con <- base::url("http://jeroen.github.io/data/diamonds.json")
results <- jq(con, 'select(.price > 15000)')
df <- jsonlite::stream_in(textConnection(results))

This will buffer the jq output in memory. If your data is really big you may want to write it to disk first:

con <- base::url("http://jeroen.github.io/data/diamonds.json")
jq(con, 'select(.price > 15000)', out = file("output.json"))
df <- jsonlite::stream_in(file("output.json"))
unlink("output.json")

jeroen avatar Mar 08 '18 20:03 jeroen

Thanks, that is much easier! The memory usage still seems to spike to >2X the size of the whole file, though, when I'm using out = file("output.json"). One thing I note is that as the memory usage grows, output isn't streaming to the file. Everything is written to the file at once after memory has reached is peak and is dropping.

noamross avatar Mar 08 '18 20:03 noamross

Hmmm are you using a jq command that operates on all data at once (such as min/max)? Try setting out = stdout() to see when output gets generated.

jeroen avatar Mar 08 '18 20:03 jeroen

@jeroen I'm not familiar with the query syntax - I've been trying to follow https://stedolan.github.io/jq/tutorial/ and am using this file: https://www.dropbox.com/s/qzzn13e77bz5k8y/2018-01-05_16-40-48_environmentlogger.json?dl=0 from which I want to extract, say, every 20th element named 'spectrum'.

But when I try this I get errors, and the errors can be reproduced in this MWE:

file <- tempfile()
writeLines(jsonlite::toJSON(list(a=1, b = list(c=3, d=4))), con = file)
jq(file, '.')

Error: Invalid numeric literal at EOF at line 1, column 30

(I would expect this simple query to return the entire contents of the file)

dlebauer avatar Mar 08 '18 21:03 dlebauer

@dlebauer You need to wrap the path in a file() otherwise it is interpreted as literal json:

tmp <- tempfile()
writeLines(jsonlite::toJSON(list(a=1, b = list(c=3, d=4))), con = tmp)
jq(file(tmp), '.')

jeroen avatar Mar 08 '18 21:03 jeroen

Thanks - that worked. Though the performance seems much slower than jsonlite::fromJSON()

> system.time(z <- jq(file(metfile), "."))
   user  system elapsed 
 77.771   0.162  78.984 
> system.time(z <- jsonlite::fromJSON(metfile))
   user  system elapsed 
  1.238   0.067   1.395 

Queries like jq(file(metfile), ".spectrum[1]") (is that the correct way to get the first element named 'spectrum'?) take even longer.

dlebauer avatar Mar 08 '18 22:03 dlebauer

What is metfile? Note that jq is for streaming ndjson data, whereas fromJSON reads a single json objects, so they do completely different things.

jeroen avatar Mar 08 '18 22:03 jeroen

@jeroen sorry the metfile is the sample from a previous comment ... here is the broken query with the missing download step required to generate the metfile

metfile <- 'met.json'
download.file('https://www.dropbox.com/s/qzzn13e77bz5k8y/2018-01-05_16-40-48_environmentlogger.json',
              destfile = metfile)
jq(file(metfile), ".spectrum[1]")

dlebauer avatar Mar 08 '18 22:03 dlebauer

but ... now I see that this is not ndjson, which is the source of my confusion. Sorry!

dlebauer avatar Mar 08 '18 22:03 dlebauer

I'm using '.map()', but I realized in your comment above that jq is expecting ndjson, while I'm passing a large array. In the command line, I use --stream to convert the single large array to ndjson before using a filter, like so:

curl https://dl.dropboxusercontent.com/s/b6cvwndezgpyiyi/test.json | jq --stream 'fromstream(1|truncate_stream(inputs)) | .mpg'

I'm not sure if using the connection in your case supposed to be equivalent to this. If I use 'fromstream(1|truncate_stream(inputs)) | .mpg' in jq() in R I get no output.

noamross avatar Mar 09 '18 03:03 noamross

Interesting, let's see if I can pass that flags somewhere. One sec.

jeroen avatar Mar 09 '18 03:03 jeroen

I added a stream parameter to jq_flags(). I think this is similar to --stream. Can you try this?

jeroen avatar Mar 09 '18 04:03 jeroen

Hmm I think there is another problem that I am internally using readLines() but your json doesn't have any linebreaks. I'll have to look into this more closely.

jeroen avatar Mar 09 '18 04:03 jeroen