pins-r icon indicating copy to clipboard operation
pins-r copied to clipboard

Explore board based on arrow's S3 support

Open hadley opened this issue 4 years ago • 3 comments

https://arrow.apache.org/docs/r/articles/fs.html#file-systems-that-emulate-s3

hadley avatar Oct 06 '21 20:10 hadley

Via @GShotwell, this much is already possible:

library(pins)

board <- board_connect(server = "https://colorado.posit.co/rsc/",
                         account = "[email protected]",
                         key = Sys.getenv("COLORADO_KEY"))

pin(mtcars, board = board)

library(duckdb)
library(DBI)
con <- DBI::dbConnect(duckdb())
dbExecute(con, "INSTALL 'httpfs.duckdb_extension'")

dbGetQuery(con, "SELECT mpg FROM 'https://colorado.posit.co/rsc/content/519521d1-a6a1-45e6-a5ec-01046686f85f/data.csv'")

juliasilge avatar Jun 08 '23 16:06 juliasilge

This is what Hugging face does for their flat files. The way they do it is:

  • Convert everything to parquet
  • Shard files at 500GB

I think this would be a very good Connect feature because it really reduces the memory footprint of Connect assets without sacrificing much speed.

gshotwell avatar Jun 08 '23 16:06 gshotwell

Isn't the example above working only because that file is publicly readable? There needs to be some kind of R filesystem abstraction duckdb can use to authenticate (either arrow fs, or similar to fsspec in python, or using duckdb's httpfs for non-connect cases)

I'm guessing you can use httpfs right now, but it won't support connect, since connect is not s3 compatible (only s3, gcs, etc..)

machow avatar Jun 08 '23 17:06 machow