Explore board based on arrow's S3 support
https://arrow.apache.org/docs/r/articles/fs.html#file-systems-that-emulate-s3
Via @GShotwell, this much is already possible:
library(pins)
board <- board_connect(server = "https://colorado.posit.co/rsc/",
account = "[email protected]",
key = Sys.getenv("COLORADO_KEY"))
pin(mtcars, board = board)
library(duckdb)
library(DBI)
con <- DBI::dbConnect(duckdb())
dbExecute(con, "INSTALL 'httpfs.duckdb_extension'")
dbGetQuery(con, "SELECT mpg FROM 'https://colorado.posit.co/rsc/content/519521d1-a6a1-45e6-a5ec-01046686f85f/data.csv'")
This is what Hugging face does for their flat files. The way they do it is:
- Convert everything to parquet
- Shard files at 500GB
I think this would be a very good Connect feature because it really reduces the memory footprint of Connect assets without sacrificing much speed.
Isn't the example above working only because that file is publicly readable? There needs to be some kind of R filesystem abstraction duckdb can use to authenticate (either arrow fs, or similar to fsspec in python, or using duckdb's httpfs for non-connect cases)
I'm guessing you can use httpfs right now, but it won't support connect, since connect is not s3 compatible (only s3, gcs, etc..)