duckdb-r
duckdb-r copied to clipboard
Interop with spatial extension
From https://github.com/duckdb/duckdb-r/issues/55#issuecomment-2012742409:
Perhaps not a complete summary, but:
- Right now when a database result contains a column that comes from the DuckDB spatial extension, it shows up in Arrow output as a
list(raw())
where each element contains an opaque internal binary representation. This often leads to confusion because the format can't be read anywhere except DuckDB. I think the same is true of a database result that does not go through Arrow (i.e., R/DBI) but I haven't checked. - For a database result accessed via R/DBI, I imagine that you might have enough information available to you at "convert to R" time to at least give it a class. That class could implement
st::as_sfc()
and give an error along the lines of "usest_as_ewkb()
beforecollect()
, which is the workaround. - For a database result accessed via Arrow, the solution is complex because DuckDB does not currently have a way for an extension type to customize its arrow output ( https://github.com/duckdb/duckdb_spatial/issues/153 ).
CC @paleolimbot @eitsupi @cboettig .
Assuming my problems are related to this issue (and are not distinct):
The examples below show some attempts at reading/writing from a spatial enabled duckdb. I can get data in through sql but not dbWriteTable
or sf::st_write
. Nor have I found a way to translate what duckdb returns for the geom column into something R (sf
or wk
can understand.
library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.3.3
#> Loading required package: DBI
library('sf')
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library('glue')
# Init DB
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
#> [1] 0
# USE ST read
dbExecute(con, glue("create table spat as select *
from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
#> [1] 99
# Try to get the stuff
a1 = dbGetQuery(con, 'select * from spat')
# I've not found a way to convert format of geom into something usable
a2 = st_read(con, 'spat') #breaks on the binary
#> Warning in st_read.DBIObject(con, "spat"): Could not find a simple features
#> geometry column. Will return a `data.frame`.
# Try to write it
# Gets errors
nc = sf::read_sf(system.file("shape/nc.shp", package="sf"))
dbWriteTable(con, name = 'nc', nc)
#> Note: method with signature 'DBIObject#sf' chosen for function 'dbDataType',
#> target signature 'duckdb_connection#sf'.
#> "duckdb_connection#ANY" would also be valid
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'
sf::st_write(nc, dsn = con, layer = 'pts')
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'
dbWriteTable(con, 'nc', nc, append = T, field.types = c('id' = 'INTEGER', 'geometry' = 'GEOMETRY'))
#> Error: rapi_execute: Failed to run query
#> Error: Conversion Error: ParseException: Unknown type: '0106000020AB100000010000000103000000010000001B000000000000A0415E54C000000060FF1D4240000000209D6254C000000080E122424000000080F76354C0000000200523424000000020846854C0000000A09B2B4240000000C06D6F54C00000000026324240000000A0B06C54C000000040633C4240000000A0FA6C54C0000000C07942424000000040E16A54C0000000A0794B424000000060195654C0000000A053494240000000203E5654C000000060DA44424000000020C95454C000000040C0414240000000800D5454C000000080873D4240000000000A5154C000000060F637424000000060D25054C000000060D833424000000080674F54C0000000C090304240000000605A4F54C000000040C42E424000000060E95054C0000000E01B2D4240000000400E5554C000000040872E4240000000C0205754C000000060342D424000000080675754C000000000662B424000000020AA5654C0000000205D26424000000060845754C000000060AC23424000000040025A54C0000000A07C244240000000A0635A54C0000000A03622424000000020965B54C0000000405F21424000000020FC5C54C0000000C0AA1E4240000000A0415E54C000000060FF1D4240'
Created on 2024-03-28 with reprex v2.1.0
@dcaseykc You are really close, just missing two things:
First, you must remember that duckdb spatial has it's own native encoding of a geometry column that is not WKB. So it is up to you to manually call st_AsWKB
to turn the geom column into WKB so other tools like sf
can read it. Second, you should tell sf
where to find the geometry column. Try this:
library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
a1 = dbGetQuery(con, 'select * from spat')
q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry")
However, I find all that really verbose and hard to read. duckdbfs
is a small wrapper around this syntax with helper functions. The above is the same as:
library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()
HTH
@cboettig thanks for your response to the original post. It is most helpful. Can you please give an example of how open_dataset can be used to open a spatial dataset from duckdb in memory. open_dateset requires a "sources" parameter.
Based on your feedback above the following successfully returns an sf object from the in memory database.
duckDBMemory <- duckdb::dbConnect(duckdb())
query <-
tbl(duckDBMemory, "cadastre_tbl") |>
mutate(geometry = st_aswkb(geom)) |>
dbplyr::sql_render()
result <-
st_read(
duckDBMemory,
query = query,
geometry_column = "geometry")
My expectation is that the following should do the same:
result <- open_dataset(
conn = duckDBMemory,
tblname = "cadastre_tbl"
)
However, this returns Error in
open_dataset(): ! argument "sources" is missing, with no default
Your advice will be appreciated.
@jacodevries Not entirely sure I follow your question. Your example doesn't run -- it refers to a tbl(duckDBMemory, "cadastre_tbl")
, but this table doesn't yet exist -- your code has provided no indication where this table is coming from, and yet you are creating an new in-memory connection on the line before, so it can't already be in memory on, say, a on-disk table. This spatial data has to come from somewhere, right? in duckdbfs
that is the sources
argument, as in the examples above (a shapefile, csv, geojson, parquet, etc etc).
If you already have a connection to a duckdb dataset that already has a the desired table materialized in it, you can just do tbl(duckDBMemory, "cadastre_tbl")
to establish a connection to it, as normal. Note that duckbfs
functions should be able to work with this table, e.g.
library(duckdb)
library(dplyr)
library(duckdbfs)
library(sf)
duckDBMemory <- duckdb::dbConnect(duckdb())
ex = open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp",
tblname = "cadastre_tbl", conn = duckDBMemory)
# If the table already exists, just use `tbl` like always!
also_ex <- tbl(duckDBMemory, "cadastre_tbl")
# this works just fine with other duckdbfs functions, like to_sf() shown above:
also_ex <- tbl(duckDBMemory, "cadastre_tbl")
# Note that these are in fact identical
identical(ex, also_ex)
so we can use open_dataset()
to read from a serialized file, or we can use tbl()
as always to connect to an existing table or view already imported to duckdb. Does that make sense?
@cboettig My apology - I realised that I left out a few critical details - let me rephrase my question:
-
I have been struggling with the same issue, in that duckdb returns the geometries of spatial data in raw format. i.e. using
also_ex <- tbl(duckDBMemory, "cadastre_tbl")
returns the data, but the geometry is in raw format. -
Your post solved the problem. Using the following suggestion which you posted, the spatial dataset is returned in an sf dataframe.
library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
a1 = dbGetQuery(con, 'select * from spat')
q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry")
- However, I noted your comment that the above is verbose, and that the following
duckdbfs
helper functions can achieved the same.
library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()
I've therefore tried to figure out how to use your suggested wrapper function to read the duckdb data directly in order to make the code less verbose.
Therefore using your suggested code snippets:
The spatial database is created using:
library(duckdb)
library(sf)
library(glue)
library(dplyr)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install spatial; load spatial;")
dbExecute(con, glue("create table spat as select *
from ST_Read('{system.file('shape/nc.shp', package='sf')}')"))
Then using the following code snippet, the spatial data is returned, but in raw format:
a1 = dbGetQuery(con, 'select * from spat')
The following verbose snippet returns the data in sf format:
q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry")
My question: how can I use the duckdbfs
helper function to make the request less verbose, but return the data in sf format:
library(duckdbfs)
nc <- open_dataset(conn = con, tblname = "spat")
However, this returns Error in open_dataset(): ! argument "sources" is missing, with no default
Is it possible to use open_dataset()
to read from a duckdb database in memory - what should the value of the sources
parameter be?
@jacodevries yes, I think I understand what you are asking, but please use the function dplyr::tbl()
when you have a connection and the name of an existing table. I do not understand why you are trying to use open_dataset()
on that. Specifically, in your code above, just run:
library(dplyr); library(duckdbfs)
tbl(con, "spat") |> to_sf()
That is the same as the more verbose code you show:
q <- tbl(con, "spat") |> mutate(geometry = st_aswkb(geom)) |> dbplyr::sql_render()
a2 = st_read(con, query = q, geometry_column = "geometry")
these two blocks of code are essentially identical.
@cboettig Thanks for your response to my query.
My attempt to use open_dataset is because of a misunderstanding on my part of when to use of the following code chunk:
library(duckdbfs)
nc <- open_dataset("https://github.com/r-spatial/sf/raw/main/inst/shape/nc.shp")
nc |> to_sf()
Thanks very much for your assistance.