Failing to speed up `st_union` / find access to `rgeos::gUnaryUnion` via `sf`
I am trying to speed up a 'dissolve' operation on the following moderately-sized spatial object (link to RDS file, 950 kB), which has an attribute called DCOMIRIS to identify the geometries:
library(sp)
library(rgeos)
library(sf)
packageVersion("sp") # 2.2.0
packageVersion("rgeos") # 0.6.4
packageVersion("sf") # 1.0.23
# example data
# https://f.briatte.org/temp/example_spatial_object.rds
xx <- readRDS("example_spatial_object.rds")
class(xx) # SpatialPolygonsDataFrame
length(xx@data$DCOMIRIS) # 992 polygons
sum(duplicated(xx@data$DCOMIRIS)) # 176 polygons with duplicate IDs
# de-duplicated result should have 992 - 176 = 816 polygons
Using rgeos::gUnaryUnion on it gets the job done fast enough:
# [1] merge with {rgeos}
with_rgeos <- rgeos::gUnaryUnion(xx, id = xx@data$DCOMIRIS)
length(with_rgeos) # okay
However, the same operation with sf::st_union is much slower:
# [2] merge with {sf}
with_sf <- sf::st_as_sf(xx) %>%
group_by(DCOMIRIS) %>%
summarise(geometry = sf::st_union(geometry))
nrow(with_sf) # okay, but very much slower
I have resorted to splitting the data in two parts, in order to apply sf::st_union only to those polygons that I know share the same ID.
My first attempt to do so is below. It fails for some reason that I cannot understand:
# [3] merge with {sf}, trying to skip single-id polygons
with_sf_fail <- sf::st_as_sf(xx) %>%
group_by(DCOMIRIS) %>%
add_count(name = "n_geom") %>%
summarise(geometry = if_else(n_geom == 1, geometry,
sf::st_union(geometry)))
nrow(with_sf_fail) # failed to merge -- why?
# `with_sf_fail` also lost its grouping variable DCOMIRIS -- why?
Splitting the data in two different datasets actually works, but it is still much slower than the rgeos method listed above:
# [4] merge with {sf}, separating the data in two slices
# slice 1, duplicate-ID polygons
z1 <- sf::st_as_sf(xx) %>%
group_by(DCOMIRIS) %>%
filter(n() > 1) %>%
summarise()
# slice 2, single-ID polygons (dropping non-grouping attributes)
z2 <- sf::st_as_sf(xx) %>%
# line below is slightly faster than having `group_by` separately
filter(n() == 1, .by = DCOMIRIS) %>%
select(DCOMIRIS, geometry)
with_sf_split <- sf:::rbind.sf(z1, z2)
nrow(with_sf_split) # okay, but... still slow
Questions
- I am wondering whether there is any way to either call
rgeos::gUnaryUnionfromsf, or to speed up the last 'method' shown above. - As an aside, I am also unsure about where the error lies in the code of my third 'method' (although I am conscious that solving the error would probably not provide me with a solution to my speed issue).
I have read related issue, e.g. #661, but did not get the impression that they have led to a solution. All apologies if they have, or if I missed anything else that should be obvious.
Context: trying to update an old but valuable package that went into forced retirement when rgeos and maptools were discontinued.
Could you please provide example data that can be opened? The file is ASCII (text), and:
> xx <- infoRDS("tmp/example_spatial_object.rds")
Error in infoRDS("tmp/example_spatial_object.rds") : unknown input format
Further, which package and where is its public repo?
Dear @rsbivand
Apologies if the file did not work on first try. It should now, and here's what I get on my end after downloading it:
> infoRDS("~/Downloads/example_spatial_object2.rds")
$version
[1] 2
$writer_version
[1] "4.5.2"
$min_reader_version
[1] "2.3.0"
$format
[1] "xdr"
> class(readRDS(file = "~/Downloads/example_spatial_object2.rds"))
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
I am attaching the file to this comment just in case.
Further, which package and where is its public repo?
I am trying to update this package, which uses maptools (and rgeos through it) there.
Looking at package spReapportion, have you tried using sf::st_interpolate_aw() as an alternative? See here.
For changes to migrate to sf or terra, please see https://github.com/r-spatial/evolution; there are many documents listed there. The package in question was not on CRAN or BioConductor, so was not found in 2022-2023 when we were retiring outdated implementations of interfaces to external software.
To your first question:
> infoRDS("example_spatial_object2.rds")
$version
[1] 2
$writer_version
[1] "4.5.2"
$min_reader_version
[1] "2.3.0"
$format
[1] "xdr"
> library(sp)
> library(sf)
Linking to GEOS 3.14.1, GDAL 3.12.0, PROJ 9.7.1; sf_use_s2() is TRUE
> xx <- readRDS("example_spatial_object2.rds")
> class(xx) # SpatialPolygonsDataFrame
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
> library(dplyr)
> system.time(xx |> sf::st_as_sf() |> group_by(DCOMIRIS) |> summarise(geometry = sf::st_union(geometry)) -> with_sf)
user system elapsed
8.562 0.611 9.327
> system.time(xx[, "DCOMIRIS"] |> sf::st_as_sf() |> aggregate(list(xx$DCOMIRIS), head, n=1) -> with_sf1)
user system elapsed
0.339 0.004 0.345
> length(xx$DCOMIRIS)
[1] 992
> length(unique(xx$DCOMIRIS))
[1] 979
> sum(duplicated(xx$DCOMIRIS))
[1] 13
> length(unique(xx$DCOMIRIS)) + sum(duplicated(xx$DCOMIRIS))
[1] 992
sf::aggregate is the correct match for rgeos::unaryUnion.
As @edzer wrote, you may well find that sf::st_interpolate_aw offers what your package provided.
If you need to re-create output from your package, you'll find MacOS binaries of retired packages for R 4.2 and 4.3, and Windows binaries for 4.2 - 4.4 at https://github.com/r-spatial/evolution/tree/main/backstore (on Linux you would install rgeos from source on a version of GEOS matching the R package). I would try to see how st_interpolate_aw matches your desired output, based on https://r-spatial.org/book/05-Attributes.html#sec-area-weighted and references therein. If it does, simply use that. If not, establish a baseline output from installing your package on an older version of R with matching rgeos (easiest as binary), and re-implement the functionality you need possibly using materials in https://github.com/r-spatial/evolution with sf.
A side note - handling of coordinate reference systems has also changed, where I think the EPSG code for what you had in the example data set is 2154.
Dear @rsbivand
This works perfectly, thank you very much!
@edzer -- I will make sure to take a look at sf::st_interpolate_aw() too, thanks for this.