qs
qs copied to clipboard
Allow `file` being an url or connection?
I have a somewhat special use case where I need to download R objects from GitHub.
Currently the workflow is readRDS(url("some_github_url"))
which is working because readRDS
allows a connection for it's file
argument. I realized that qs::qread()
could speed this up heavily as reading some example files from disk is more than 4 times faster.
However, I wasn't able to do this without downloading the file to a temp directory like in the below function
load_qs <- function(url) {
tmp <- tempfile(fileext = "qs")
download.file(url, tmp, quiet = TRUE)
qs::qread(tmp)
}
I am pretty sure this isn't the most efficient solution and would like to ask for ideas or even the implementation analogue to readRDS? (I don't know if this is an awful idea so please let me know!)
update: I was able to speed up my function in case anyone is interested
load_qs <- function(url) qs::qdeserialize(curl::curl_fetch_memory(url)$content)
It is a good idea, but CRAN doesn't allow using R-connections directly within C code. Glad you found a workaround!
Ah dang CRAN. Before you replied I found what readRDS is actually doing. It should be the below given code block
https://github.com/microsoft/microsoft-r-open/blob/d72636113ede1c1d28959c8b8723c15c694957f4/source/src/main/serialize.c#L2236-L2282
I assume it's a CRAN exception for base R
Is there any update to allow qs::qread to read URLs? Wrapping load_qs inside qs:qread would help a lot.
@zecojls Sure it could be put in for a next update, just would like to think about how it looks.
Could you help me prototype this? Here are my thoughts:
I'd prefer to not have curl
as a strict dependency (just to keep requirements at an absolute minimum). Is there a base-R option that's just a performant?
I'm thinking it should be in a separate function such as aqread_url
, because qread
is auto-generated by Rcpp (linking to the C++ code).
I was just googling about it and found this qs_from_url function in the nflverse package. I agree that avoiding dependencies is good, but I think curl is pretty active and well-maintained.
curl
is great, but it has a system libcurl-dev
requirement which presents an challenge e.g. if you're on a linux workstation where you don't have admin privileges.
So I'm considering two options, use curl and add it as a suggested dependency:
qread_url <- function(url, ...) {
if(<check if curl installed>) {
qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
} else {
stop("qread_url requires curl installed")
}
}
Or some base R solution such as:
qread_url <- function(url, ...) {
con <- url(url, mode = "rb", raw = TRUE)
buffer_size <- 10000
data <- ...
while(x <- readBin(con, buffer_size)) {
<append x to data>
...
}
close(con)
qdeserialize(data, ...)
}
Well, they are pretty much the same I think (depends on the internet connection). Reading a 13 Mb file from google cloud storage took me around 3 sec in both modes. I think that sticking to base R is great but I'm not sure how it deals with larger files that extrapolate the chunk size. Unfortunately, I have no idea how to recursively download the chunks and append them.
library("qs")
library("curl")
library("tictoc")
options(timeout=240)
qread_url_curl <- function(url, ...) {
if(!require("curl")) {
stop("qread_url requires curl installed")
} else {
qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
}
}
qread_url_base <- function(url, ...) {
con <- file(url, "rb", raw = TRUE)
buffer_size <- 2^31-1 # limit from readBin help
x <- readBin(con, what = "raw", n = buffer_size)
close(con)
qs::qdeserialize(x)
}
target.url <- "https://storage.googleapis.com/soilspec4gg-test/test.qs"
# 2.993 sec
tic()
test1 <- qread_url_curl(target.url)
toc()
# 2.991
tic()
test2 <- qread_url_base(target.url)
toc()
New version on CRAN has this function.