`st_as_sf()` only works after `open_dataset()` but not after `read_parquet()`
Hey Dewey,
Thanks so much for your work on this, really excited to start using it!
I noticed that reading and converting a parquet file written with geoarrow to an sf object only works if you open the file with open_dataset() first, but not if you read it with read_parquet() (reprex below).
Maybe this is by design? But I thought I'd flag in case it's not!
Thanks, Paul
library(geoarrow)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
#> The repository you retrieved Arrow from did not include all of Arrow's features.
#> You can install a fully-featured version by running:
#> `install.packages('arrow', repos = 'https://apache.r-universe.dev')`.
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.9.0, PROJ 9.4.0; sf_use_s2() is TRUE
nc <- read_sf(system.file("gpkg/nc.gpkg", package = "sf"))
tf <- tempfile(fileext = ".parquet")
nc |>
tibble::as_tibble() |>
write_parquet(tf)
# this works
open_dataset(tf) |> st_as_sf()
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
#> First 10 features:
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
#> 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
#> 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
#> 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
#> 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
#> 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
#> 6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
#> 7 0.062 1.547 1834 1834 Camden 37029 37029 15 286 0
#> 8 0.091 1.284 1835 1835 Gates 37073 37073 37 420 0
#> 9 0.118 1.421 1836 1836 Warren 37185 37185 93 968 4
#> 10 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 1
#> NWBIR74 BIR79 SID79 NWBIR79 geom
#> 1 10 1364 0 19 MULTIPOLYGON (((-81.47276 3...
#> 2 10 542 3 12 MULTIPOLYGON (((-81.23989 3...
#> 3 208 3616 6 260 MULTIPOLYGON (((-80.45634 3...
#> 4 123 830 2 145 MULTIPOLYGON (((-76.00897 3...
#> 5 1066 1606 3 1197 MULTIPOLYGON (((-77.21767 3...
#> 6 954 1838 5 1237 MULTIPOLYGON (((-76.74506 3...
#> 7 115 350 2 139 MULTIPOLYGON (((-76.00897 3...
#> 8 254 594 2 371 MULTIPOLYGON (((-76.56251 3...
#> 9 748 1190 2 844 MULTIPOLYGON (((-78.30876 3...
#> 10 160 2038 5 176 MULTIPOLYGON (((-80.02567 3...
# this doesn't
read_parquet(tf) |> st_as_sf()
#> Error in st_sf(x, ..., agr = agr, sf_column_name = sf_column_name): no simple features geometry column present
# but it does if you convert the geom col first
read_parquet(tf) |> dplyr::mutate(geom = st_as_sfc(geom)) |> st_as_sf()
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
#> # A tibble: 100 × 15
#> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
#> 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
#> 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
#> 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
#> 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
#> 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>,
#> # geom <MULTIPOLYGON [°]>
Created on 2024-07-15 with reprex v2.0.2
Thanks for opening!
This is not by design, per se, but I think it happens because st_as_sf() will only recognize columns that inherit from sfc. In order for this to work, it would have to apply some heuristics to recognize other geometry-like columns within an existing data.frame. For things that haven't yet been converted to a data.frame, we own the st_as_sf implementation, which is why the conversion works 🙂 .
I should probably open an issue in sf for this!
-dewey
Makes sense! Thanks for clarifying 👍