s2
s2 copied to clipboard
`s2_union_agg()` is slow
The strategy of accumulating a union is correct but slow! I assume there is a faster way to go about this that I haven't found in the documentation yet.
library(s2)
library(sf)
#> Warning: package 'sf' was built under R version 4.0.5
#> Linking to GEOS 3.8.1, GDAL 3.2.0, PROJ 7.2.0
countries <- s2_data_countries()
countries_sf <- st_as_sf(countries)
bench::mark(
s2_coverage_union_agg(countries),
s2_union_agg(countries),
st_union(countries_sf),
check = F
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 s2_coverage_union_agg(countries) 100.89ms 101.7ms 9.59 61.7KB
#> 2 s2_union_agg(countries) 2.82s 2.82s 0.355 43.7KB
#> 3 st_union(countries_sf) 141.69ms 144.13ms 6.90 666.1KB
#> # … with 1 more variable: gc/sec <dbl>
Created on 2021-04-28 by the reprex package (v0.3.0)
This is not a solution for the speed of s2_union_agg. I am posting a wrapper function that might be useful to someone landing on this issue, such as I did while looking for faster processing using s2_union_agg
.
The function below s2_union_split_agg
is a wrapper function to speed s2 union transformation of large data sets. For small data sets that are highly intersecting (e.g. countries) this workaround does not have much effect and might even be slower. However, on large data sets with sparse features, s2_union_split_agg
is extremely fast because it splits the features into intersecting groups of features and applies s2_union_agg
only to these groups independently (potentially in parallel). It solves internal boundaries first for each intersecting group then combines all results into one object.
s2_union_split_agg
needed about 70 seconds to perform the union on a personal global data set with 47,069 polygon features, including about 4,000 intersecting polygons. I did not measure the time using s2_union_agg
on this data set because I had to kill the process after about 1.5h. Below is an example using the countries
data set. Hope this is useful to someone.
library(s2)
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.2, PROJ 7.2.1
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
s2_union_split_agg <- function(x, options = s2_options()){
# check for intersects
idx_intersects <- s2::s2_intersects_matrix(x, x, options = options)
# prapare groups of intersects
idx_list <- lapply(seq_along(idx_intersects)[lengths(idx_intersects) > 1], function(i) c(i, idx_intersects[[i]]))
group_list <- tibble::tibble(fid = integer(), gid = integer())
for(i in idx_list){
# check if intersecting fids already exist in the groups
gids <- dplyr::filter(group_list, fid %in% unique(i))
if(nrow(gids) > 0){
# get for new fids
new_ids <- unique(i)[!unique(i) %in% group_list$fid]
# get existing groups
new_gid <- unique(gids$gid)[1]
# add new fids to existing groups
group_list <- tibble::add_row(group_list, fid = new_ids, gid = new_gid)
# merge groups if new fids overlap several groups
group_list <- dplyr::mutate(group_list, gid = ifelse(gid %in% gids$gid, new_gid, gid))
} else {
# create a new group id
gid <- ifelse(nrow(group_list) == 0, 0, max(group_list$gid)) + 1
# add all new fids to the new group
group_list <- dplyr::bind_rows(tibble::tibble(fid = unique(i), gid = gid), group_list)
}
}
# split features based on groups of intersects
x_split <- split(x[group_list$fid], group_list$gid)
# spare features that do not intersect other features
x <- x[-group_list$fid]
# apply s2_union_agg to groups
x_union <- lapply(x_split, s2_union_agg, options = options)
# gather results into single object
result <- s2_rebuild_agg(c(x, do.call("c", x_union)), options = options)
return(result)
}
countries <- s2_data_countries()
system.time(res1 <- s2_union_split_agg(countries, options = s2_options(model = "closed")))
#> user system elapsed
#> 4.067 0.028 4.105
system.time(res2 <- s2_union_agg(countries, options = s2_options(model = "closed")))
#> user system elapsed
#> 4.527 0.002 4.530
# the output geometries have no difference
s2_difference(res1, res2)
#> <s2_geography[1]>
#> [1] <GEOMETRYCOLLECTION EMPTY>
# but the objects are not identical
identical(res1, res2)
#> [1] FALSE
plot(st_as_sf(res1))
plot(st_as_sf(res2))
Created on 2021-09-17 by the reprex package (v2.0.1)
I never got around to thanking you for this example! I can and will implement something like this at the C++ level to speed up this type of union...hopefully for the next release!
I still can and will make better use of this idea of intersecting groups, but I'm including a slight improvement in #165 that will make this a bit faster (although it's still O(n)).
# remotes::install_github("r-spatial/s2#165")
library(s2)
library(ggplot2)
res <- bench::press(
n = c(1:5),
bench::mark(s2_union_agg(rep(s2_data_countries(), n)))
)
#> Running with:
#> n
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
ggplot(res, aes(n, median)) +
geom_point()
# install.packages("s2")
library(s2)
library(ggplot2)
res <- bench::press(
n = c(1:5),
bench::mark(s2_union_agg(rep(s2_data_countries(), n)))
)
#> Running with:
#> n
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
ggplot(res, aes(n, median)) +
geom_point()
Created on 2022-03-05 by the reprex package (v2.0.1)