qs icon indicating copy to clipboard operation
qs copied to clipboard

Allow `file` being an url or connection?

Open mrcaseb opened this issue 4 years ago • 3 comments

I have a somewhat special use case where I need to download R objects from GitHub.

Currently the workflow is readRDS(url("some_github_url")) which is working because readRDS allows a connection for it's file argument. I realized that qs::qread() could speed this up heavily as reading some example files from disk is more than 4 times faster.

However, I wasn't able to do this without downloading the file to a temp directory like in the below function

load_qs <- function(url) {
  tmp <- tempfile(fileext = "qs")
  download.file(url, tmp, quiet = TRUE)
  qs::qread(tmp)
}

I am pretty sure this isn't the most efficient solution and would like to ask for ideas or even the implementation analogue to readRDS? (I don't know if this is an awful idea so please let me know!)

mrcaseb avatar Jan 29 '21 11:01 mrcaseb

update: I was able to speed up my function in case anyone is interested

load_qs <- function(url) qs::qdeserialize(curl::curl_fetch_memory(url)$content)

mrcaseb avatar Jan 29 '21 13:01 mrcaseb

It is a good idea, but CRAN doesn't allow using R-connections directly within C code. Glad you found a workaround!

traversc avatar Jan 29 '21 18:01 traversc

Ah dang CRAN. Before you replied I found what readRDS is actually doing. It should be the below given code block

https://github.com/microsoft/microsoft-r-open/blob/d72636113ede1c1d28959c8b8723c15c694957f4/source/src/main/serialize.c#L2236-L2282

I assume it's a CRAN exception for base R

mrcaseb avatar Jan 29 '21 20:01 mrcaseb

Is there any update to allow qs::qread to read URLs? Wrapping load_qs inside qs:qread would help a lot.

zecojls avatar Dec 21 '22 14:12 zecojls

@zecojls Sure it could be put in for a next update, just would like to think about how it looks.

Could you help me prototype this? Here are my thoughts:

I'd prefer to not have curl as a strict dependency (just to keep requirements at an absolute minimum). Is there a base-R option that's just a performant?

I'm thinking it should be in a separate function such as aqread_url, because qread is auto-generated by Rcpp (linking to the C++ code).

traversc avatar Dec 21 '22 18:12 traversc

I was just googling about it and found this qs_from_url function in the nflverse package. I agree that avoiding dependencies is good, but I think curl is pretty active and well-maintained.

zecojls avatar Dec 21 '22 19:12 zecojls

curl is great, but it has a system libcurl-dev requirement which presents an challenge e.g. if you're on a linux workstation where you don't have admin privileges.

So I'm considering two options, use curl and add it as a suggested dependency:

qread_url <- function(url, ...) {
if(<check if curl installed>) {
  qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
} else {
  stop("qread_url requires curl installed")
}
}

Or some base R solution such as:

qread_url <- function(url, ...) {
con <- url(url, mode = "rb", raw = TRUE)
buffer_size <- 10000
data <- ...
while(x <- readBin(con, buffer_size)) {
  <append x to data>
  ...
}
close(con)
qdeserialize(data, ...)
}

traversc avatar Dec 21 '22 20:12 traversc

Well, they are pretty much the same I think (depends on the internet connection). Reading a 13 Mb file from google cloud storage took me around 3 sec in both modes. I think that sticking to base R is great but I'm not sure how it deals with larger files that extrapolate the chunk size. Unfortunately, I have no idea how to recursively download the chunks and append them.

library("qs")
library("curl")
library("tictoc")

options(timeout=240)

qread_url_curl <- function(url, ...) {
  if(!require("curl")) {
    stop("qread_url requires curl installed")
  } else {
    qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
  }
}

qread_url_base <- function(url, ...) {
  con <- file(url, "rb", raw = TRUE)
  buffer_size <- 2^31-1 # limit from readBin help
  x <- readBin(con, what = "raw", n = buffer_size)
  close(con)
  qs::qdeserialize(x)
}

target.url <- "https://storage.googleapis.com/soilspec4gg-test/test.qs"

# 2.993 sec
tic()
test1 <- qread_url_curl(target.url)
toc()

# 2.991
tic()
test2 <- qread_url_base(target.url)
toc()

zecojls avatar Dec 21 '22 21:12 zecojls

New version on CRAN has this function.

traversc avatar Feb 27 '23 07:02 traversc