Benchmark
Two years ago I created the benchmark testset comparing packages for processing vector data (sf, terra, geopandas, geos, s2), and now I'm naturally interested in how rsgeo looks like (and maybe @robinlovelace too 😃). rsgeo is a low-level interface, so the number of operations is limited -- I only tested the calculation of Euclidean distance between points and geometry intersects (it's pity that buffering is not originally implemented). My conclusions:
- In your comparison in README, the calculation of Euclidean distance is more than 10 times faster than in
sf. However, for a larger dataset, the difference is about 30% (or another reason, the results can be outdated). It seems to me that here the best solution is to transform the geometry to a data frame and use C++ functions (such asRfast::Dist()). - I see that
rustgeodoesn't support prepared geometries unlikegeos, so there are big differences here.
Benchmark
library("sf")
library("geos")
library("rsgeo")
library("Rfast") # for fast Euclidean distance
set.seed(1)
poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 300000
pts_sf = st_sample(poly_sf, n)
pts_rs = as_rsgeom(pts_sf)
pts_geos = as_geos_geometry(pts_sf)
poly_rs = as_rsgeom(poly_sf)
poly_geos = as_geos_geometry(poly_sf)
### distance between points ----------------------------------------------------
system.time(st_distance(pts_sf[1:5000], which = "Euclidean"))
#> user system elapsed
#> 1.817 0.519 2.338
system.time(euclidean_distance_matrix(pts_rs[1:5000], pts_rs[1:5000]))
#> user system elapsed
#> 1.526 0.320 1.848
system.time({
crds = st_coordinates(pts_sf[1:5000])
Rfast::Dist(crds, method = "euclidean")
})
#> user system elapsed
#> 0.589 0.120 0.707
### geometry intersects --------------------------------------------------------
system.time(intersects_sparse(pts_rs, poly_rs))
#> user system elapsed
#> 1.415 0.072 1.486
system.time(geos_intersects_matrix(pts_geos, poly_geos))
#> user system elapsed
#> 0.105 0.000 0.105
You're right! Thank you! I'm going to embark on a complete rewrite of rsgeo in the coming weeks. One thing I can do is make sure that all point distance and binary predicates first build an R* tree first.
Although that may be a bit unnecessary for when there are not enough geometries to justify it.
What do you think? Always build the tree or optionally or have separate functions for it?
Note that while there aren't "prepared" geometries we can do the work manually on the Rust side by creating the tree and inserting into it.
suggestion to make the Rfast one even faster: replace
crds = st_coordinates(pts_sf[1:5000])
with an sfheaders alternative...
What do you think? Always build the tree or optionally or have separate functions for it?
Maybe optionally? In {sf} there is the prepared argument. In {terra} it works automatically, but you can use a C++ reference to do it manually (like x@ptr$relate_between()).
BTW1: Will tree building help anything in calculating the distance matrix? I think it is brute calculating the distance between all points?
BTW2: I am not expert, but I understand that you are proposing to implement spatial index and not prepared geometry? I understand that these are two separate things? I rely on this post from Edzer: https://r-spatial.org/r/2017/06/22/spatial-index.html
Re slowness of euclidean distance matrix I think this is a result of two things:
- excessive use of cloning https://github.com/JosiahParry/rsgeo/blob/565e7ae7a4483c28860dc28a810d58dcf98c18fe/src/rust/src/distance.rs#L27
- excessive matching https://github.com/JosiahParry/rsgeo/blob/565e7ae7a4483c28860dc28a810d58dcf98c18fe/src/rust/src/distance.rs#L36
There are two matches per iteration which is quite a lot. I think this can be helped when https://github.com/georust/geo/pull/1029/files/2709d7f841af05551e1d065403a853dbe5fdde70 is merged.
Re prepared geometries (it's suprising that there is like, 0 documentation or academic papers easily found and its all source code and probably github reviews), my understanding prepared geometries comes largely from geo-rust community and discussion https://github.com/georust/geo/issues/803 is that you create a spatial index upfront (this is the preparation). Then the index can be used in subsequent calls etc to make narrowing down candidates far faster.
@Robinlovelace, little offtopic, but I tried {sfheaders} to extract coordinates from geometry and it is slower than {sf}. However, it doesn't really matter because this operation has little overhead. The second thing, maybe {sfheaders} shouldn't return two columns (i.e. sfg_id, point_id) -- they are rather unnecessary for points.
Benchmark
library("sf")
library("sfheaders")
poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 500000
pts_sf = st_sample(poly_sf, n)
t = bench::mark(
check = FALSE, iterations = 10,
sf = st_coordinates(pts_sf),
sfheaders = sfc_to_df(pts_sf)
)
t[, 1:5]
#> expression min median `itr/sec` mem_alloc
#> 1 sf 8.69ms 11.2ms 45.1 15.3MB
#> 2 sfheaders 145.88ms 152.1ms 6.61 21MB
Very interesting! I've found sfheaders v. fast for some things but clearly not all. Thanks for all these benchmarks, really helpful if not essential for working out what to use when speed's an issue!
This should be vastly improved in the new refactor branch. I just pushed a commit to parallelize the distance matrix.
It's not complete yet but the calculation portion is there. Just need to move from a list to a matrix.
https://github.com/JosiahParry/rsgeo/commit/a2ab18a61ed0433bcc039bdb73f3d8d90835729a#diff-ba040067b83f26349efd96faedb319914955849e8949c3a157330a1c0d895dae
The one to compare against is geos, faster than rsgeo for union in mini benchmark I put together: https://ogh23.robinlovelace.net/tidy#rsgeo
It's definitely going to remain faster for unioning. I think there will be a minor speed up compared to your existing benchmark but geos is very good here 😛
BTW @Robinlovelace, isn't the world object CRS WGS84? Then you are comparing the s2 spherical engine in {sf} with planar in {geos} and {rsgeo}. From my observations, in general GEOS is faster than s2. In this case, you can also use the argument is_coverage = TRUE and it will be even faster.
BTW @Robinlovelace, isn't the
worldobject CRS WGS84?
Correct, although as we're doing a union the results shouldn't differ too much, right (although I did notice several differences...)? Benchmark adaptations incoming!
A note on the distance matrix calculations. Rfast calculates a distance matrix between a vector and itself. sf can do one or the other. When you calculate a distance matrix where y = x you only need to calculate one of the triangles and the diagonal is always 0. When the other vector is not identical you do not have symmetry and cannot shortcut the calculation. When we pass in a value of y for st_distance() the computation is so much slower. And to compare to Rfast we have to use dista().
As of right now rsgeo always computes the full dense matrix. I can and should add an enhancement to check if x is equal to y and short cut it.
Edit: adding bench mark for Rfast and new distance matrix. When comparing apples to apples they are on par with eachother.
library("sf")
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
set.seed(1)
poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 5000
pts_sf = st_sample(poly_sf, n)
system.time(st_distance(pts_sf))
#> user system elapsed
#> 0.671 0.094 0.770
system.time(st_distance(pts_sf, rev(pts_sf)))
#> user system elapsed
#> 6.197 0.064 6.278
library(rsgeo)
x <- rsgeo:::from_sfc(pts_sf)
system.time(distance_euclidean_matrix(x, rev(x)))
#> user system elapsed
#> 0.348 0.168 0.321
crds <- coords(x)
c2 <- coords(rev(x))
system.time(Rfast::dista(crds, c2))
#> user system elapsed
#> 0.228 0.034 0.266
bench::mark(
Rfast = Rfast::dista(crds, c2),
rsgeo = distance_euclidean_matrix(x, rev(x)),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 Rfast 263ms 266ms 3.76 382MB 7.51
#> 2 rsgeo 215ms 252ms 3.97 191MB 1.99