terra icon indicating copy to clipboard operation
terra copied to clipboard

New Feature: Parquet vector support

Open peetmate opened this issue 1 year ago • 9 comments

Terra support for reading in .parquet vector files would be most welcome. Currently I am using sfarrow package to read into R then converting to spatvect.

peetmate avatar Nov 16 '23 12:11 peetmate

@paleolimbot, do you have any tips on how to load a .parquet file with geometry? I once tried to load using {arrow}, but I couldn't find a way to convert from arrow binary to WKT.

kadyb avatar Nov 16 '23 23:11 kadyb

The code below works without {sfarrow}, but requires {jsonlite} to parse JSON with metadata and {wk} to convert WKB to WKT.

library("wk")
library("terra")
library("arrow")
library("jsonlite")

# sample data
url = "https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet"
tmp = tempfile(fileext = ".parquet")
download.file(url, tmp, mode = "wb")

x = arrow::read_parquet(tmp, as_data_frame = FALSE)
metadata = jsonlite::fromJSON(x$metadata$geo)
crs = paste(metadata$columns$geometry$crs$id, collapse = ":")
x = as.data.frame(x)
geom = vector("list", length = length(x$geometry))
for (i in seq_len(length(x$geometry))) geom[[i]] = x$geometry[[i]]
wkt = as.character(wk::as_wkt(wk::wkb(geom)))
v = vect(wkt, crs = crs)
idx = which(colnames(x) == "geometry")
v = cbind(v, x[, -idx])
v
# class       : SpatVector 
# geometry    : polygons 
# dimensions  : 5, 5  (geometries, attributes)
# extent      : -180, 180, -18.28799, 83.23324  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 
# names       :   pop_est continent      name iso_a3 gdp_md_est
# type        :     <num>     <chr>     <chr>  <chr>      <int>
# values      :   8.9e+05   Oceania      Fiji    FJI       5496
#               5.801e+07    Africa  Tanzania    TZA      63177

kadyb avatar Nov 18 '23 14:11 kadyb

Sorry for the delay here!

I imagine this will only have reasonable levels of performance if a SpatVect can be created from something with a WKB column (basically what you have above except without the WKT conversion).

The rewrite of the geoarrow package (at https://github.com/geoarrow/geoarrow-c ) will have a GeoParquet reader as well and should make it more straightforward to create geometry directly from ArrowArrays (i.e., skip the intermediary list() of raw(), which is sometimes a bottleneck for performance). That probably won't be released to CRAN until early next year, though.

paleolimbot avatar Nov 23 '23 16:11 paleolimbot

doesn't it already work? I use vect() routinely to read from non-geom Parquet urls and pretty sure also with geometry.

Seems fine, maybe it's about the GDAL version you have available. Parquet is since version 3.5, and so you might be slammed - even ubuntu 22.04 is still GDAL 3.4.

library(terra)
 vect("/vsicurl/https://github.com/OSGeo/gdal/raw/master/autotest/ogr/data/parquet/poly_wkb_large_binary.parquet")
 class       : SpatVector 
 geometry    : polygons 
 dimensions  : 10, 3  (geometries, attributes)
 extent      : 478315.5, 481645.3, 4762880, 4765610  (xmin, xmax, ymin, ymax)
 source      : poly_wkb_large_binary.parquet
 coord. ref. : OSGB36 / British National Grid (EPSG:27700) 
 names       :      AREA EAS_ID  PRFEDEA
 type        :     <num>  <int>    <chr>
 values      : 2.152e+05    168 35043411
               2.473e+05    179 35043423
               2.618e+05    171 35043414
vect("/vsicurl/https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet")

class       : SpatVector 
 geometry    : polygons 
 dimensions  : 5, 5  (geometries, attributes)
 extent      : -180, 180, -18.28799, 83.23324  (xmin, xmax, ymin, ymax)
 source      : example.parquet
 coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84) 
 names       :   pop_est continent      name iso_a3 gdp_md_est
 type        :     <num>     <chr>     <chr>  <chr>      <int>
 values      :   8.9e+05   Oceania      Fiji    FJI       5496
               5.801e+07    Africa  Tanzania    TZA      63177
               6.033e+05    Africa W. Sahara    ESH        907

check your version and drivers with

terra::gdal()


terra::gdal(drivers = TRUE)

https://gdal.org/drivers/vector/parquet.html

On ubuntu use the unstable ppa to get 3.6 (though still that might not have the reqs)

https://launchpad.net/~ubuntugis/+archive/ubuntu/ubuntugis-unstable

mdsumner avatar Nov 25 '23 07:11 mdsumner

On Windows, R 4.3.2 includes GDAL 3.7.2, but it doesn't seem to support parquet (driver not compiled).

Edit: You are right, it's 3.7.2 now, not 3.6.2.

kadyb avatar Nov 25 '23 11:11 kadyb

let's get onto that then, Roger has retired 🙏

(my windows has 3.7.2 I think, but I don't pay any attention to that os anymore except to check cran versions )

mdsumner avatar Nov 25 '23 11:11 mdsumner

@kalibera, could you please consider adding this driver to GDAL in R? But maybe this won't be necessary if @paleolimbot completes the {geoarrow} package?

kadyb avatar Nov 25 '23 13:11 kadyb

I asked Tomas a week ago and he wrote:

Apache Arrow does not have a build configuration in MXE, so someone would have to create that. I can try, but if you could get someone else to do that, it might happen sooner. The same applies to Apache Parquet if that is needed separately.

rhijmans avatar Nov 25 '23 17:11 rhijmans

A build configuration for MXE is on our radar for the Arrow R package, too, but because the Arrow R package requires so many of Arrow's optional features (far beyond just Parquet), it's likely to be quite a bit of work. I think using the Parquet reader in the Arrow R package is probably the only reasonable approach for the foreseeable future.

From terra's side, the way to maximize performance in both cases would be to implement conversion from a WKB represented by nanoarrow_array into whatever representation is used for geometry by SpatVect.

paleolimbot avatar Nov 26 '23 00:11 paleolimbot