duckplyr icon indicating copy to clipboard operation
duckplyr copied to clipboard

duckdbfs - duckplyr comparing notes

Open cboettig opened this issue 1 year ago • 7 comments

Howdy friends! Just saw this (from the Posit Conf schedule!), looks amazing (though still wrapping my head around scope etc).

I've been playing around with some possibly similar ideas in a very small wrapper package, duckdbfs, because I didn't know about the efforts here. If it makes sense, I'd be happy to merge features into here instead and archive duckdbfs. Alternatively I'd welcome your feedback on duckdbfs

My core goal with duckdbfs was to have open_dataset() / write_dataset() functions that operate like they do in arrow, (i.e. supporting local and S3 URIs), while also supporting arbitrary https urls. (yes I know we can do things like arrow::open_dataset() |> to_duckdb(), but obviously that doesn't support https urls and adds overhead of using the arrow parser, which we found could be substantially slower than native duckdb httpfs mechanism).

e.g. S3 access, with necessary config (as per #39):

parquet <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet"
gbif <- duckdbfs::open_dataset(parquet, anonymous = TRUE, s3_region="us-east-1")

https URIs work the same way of course. duckdbfs handles installing the httpfs extension when necessary. (Yes, it's tragic that httpfs extension still doesn't work on Windows owing to how duckdbfs is building those binaries!). duckdbfs seeks to make the spatial extension immediately visible to R users in the same way, e.g.

library(dplyr)
spatial_ex <- paste0("https://raw.githubusercontent.com/cboettig/duckdbfs/",
                     "main/inst/extdata/spatial-test.csv") |>
  duckdbfs::open_dataset(format = "csv") 

spatial_ex |>
  mutate(geometry = ST_Point(longitude, latitude)) |>
  mutate(dist = ST_Distance(geometry, ST_Point(0,0))) |> 
  to_sf()

Note we use dplyr / dbplyr to do lazy spatial ops, and parse the result into R as an sf object.

cboettig avatar Sep 19 '23 17:09 cboettig