duckplyr
duckplyr copied to clipboard
duckdbfs - duckplyr comparing notes
Howdy friends! Just saw this (from the Posit Conf schedule!), looks amazing (though still wrapping my head around scope etc).
I've been playing around with some possibly similar ideas in a very small wrapper package, duckdbfs, because I didn't know about the efforts here. If it makes sense, I'd be happy to merge features into here instead and archive duckdbfs. Alternatively I'd welcome your feedback on duckdbfs
My core goal with duckdbfs
was to have open_dataset()
/ write_dataset()
functions that operate like they do in arrow, (i.e. supporting local and S3 URIs), while also supporting arbitrary https urls. (yes I know we can do things like arrow::open_dataset() |> to_duckdb()
, but obviously that doesn't support https urls and adds overhead of using the arrow parser, which we found could be substantially slower than native duckdb httpfs mechanism).
e.g. S3 access, with necessary config (as per #39):
parquet <- "s3://gbif-open-data-us-east-1/occurrence/2023-06-01/occurrence.parquet"
gbif <- duckdbfs::open_dataset(parquet, anonymous = TRUE, s3_region="us-east-1")
https URIs work the same way of course. duckdbfs
handles installing the httpfs extension when necessary. (Yes, it's tragic that httpfs extension still doesn't work on Windows owing to how duckdbfs
is building those binaries!). duckdbfs
seeks to make the spatial extension immediately visible to R users in the same way, e.g.
library(dplyr)
spatial_ex <- paste0("https://raw.githubusercontent.com/cboettig/duckdbfs/",
"main/inst/extdata/spatial-test.csv") |>
duckdbfs::open_dataset(format = "csv")
spatial_ex |>
mutate(geometry = ST_Point(longitude, latitude)) |>
mutate(dist = ST_Distance(geometry, ST_Point(0,0))) |>
to_sf()
Note we use dplyr
/ dbplyr
to do lazy spatial ops, and parse the result into R as an sf
object.