rsgeo icon indicating copy to clipboard operation
rsgeo copied to clipboard

Benchmark

Open kadyb opened this issue 2 years ago • 13 comments

Two years ago I created the benchmark testset comparing packages for processing vector data (sf, terra, geopandas, geos, s2), and now I'm naturally interested in how rsgeo looks like (and maybe @robinlovelace too 😃). rsgeo is a low-level interface, so the number of operations is limited -- I only tested the calculation of Euclidean distance between points and geometry intersects (it's pity that buffering is not originally implemented). My conclusions:

  1. In your comparison in README, the calculation of Euclidean distance is more than 10 times faster than in sf. However, for a larger dataset, the difference is about 30% (or another reason, the results can be outdated). It seems to me that here the best solution is to transform the geometry to a data frame and use C++ functions (such as Rfast::Dist()).
  2. I see that rustgeo doesn't support prepared geometries unlike geos, so there are big differences here.
Benchmark
library("sf")
library("geos")
library("rsgeo")
library("Rfast") # for fast Euclidean distance
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 300000
pts_sf = st_sample(poly_sf, n)

pts_rs = as_rsgeom(pts_sf)
pts_geos = as_geos_geometry(pts_sf)
poly_rs = as_rsgeom(poly_sf)
poly_geos = as_geos_geometry(poly_sf)

### distance between points ----------------------------------------------------
system.time(st_distance(pts_sf[1:5000], which = "Euclidean"))
#>  user  system elapsed
#> 1.817   0.519   2.338
system.time(euclidean_distance_matrix(pts_rs[1:5000], pts_rs[1:5000]))
#>  user  system elapsed
#> 1.526   0.320   1.848
system.time({
  crds = st_coordinates(pts_sf[1:5000])
  Rfast::Dist(crds, method = "euclidean")
})
#>  user  system elapsed
#> 0.589   0.120   0.707

### geometry intersects --------------------------------------------------------
system.time(intersects_sparse(pts_rs, poly_rs))
#>  user  system elapsed
#> 1.415   0.072   1.486

system.time(geos_intersects_matrix(pts_geos, poly_geos))
#>  user  system elapsed
#> 0.105   0.000   0.105

kadyb avatar Jul 10 '23 21:07 kadyb

You're right! Thank you! I'm going to embark on a complete rewrite of rsgeo in the coming weeks. One thing I can do is make sure that all point distance and binary predicates first build an R* tree first.

Although that may be a bit unnecessary for when there are not enough geometries to justify it.

What do you think? Always build the tree or optionally or have separate functions for it?

JosiahParry avatar Jul 10 '23 22:07 JosiahParry

Note that while there aren't "prepared" geometries we can do the work manually on the Rust side by creating the tree and inserting into it.

JosiahParry avatar Jul 10 '23 22:07 JosiahParry

suggestion to make the Rfast one even faster: replace

  crds = st_coordinates(pts_sf[1:5000])

with an sfheaders alternative...

Robinlovelace avatar Jul 11 '23 06:07 Robinlovelace

What do you think? Always build the tree or optionally or have separate functions for it?

Maybe optionally? In {sf} there is the prepared argument. In {terra} it works automatically, but you can use a C++ reference to do it manually (like x@ptr$relate_between()).

BTW1: Will tree building help anything in calculating the distance matrix? I think it is brute calculating the distance between all points?

BTW2: I am not expert, but I understand that you are proposing to implement spatial index and not prepared geometry? I understand that these are two separate things? I rely on this post from Edzer: https://r-spatial.org/r/2017/06/22/spatial-index.html

kadyb avatar Jul 11 '23 08:07 kadyb

Re slowness of euclidean distance matrix I think this is a result of two things:

  1. excessive use of cloning https://github.com/JosiahParry/rsgeo/blob/565e7ae7a4483c28860dc28a810d58dcf98c18fe/src/rust/src/distance.rs#L27
  2. excessive matching https://github.com/JosiahParry/rsgeo/blob/565e7ae7a4483c28860dc28a810d58dcf98c18fe/src/rust/src/distance.rs#L36

There are two matches per iteration which is quite a lot. I think this can be helped when https://github.com/georust/geo/pull/1029/files/2709d7f841af05551e1d065403a853dbe5fdde70 is merged.

Re prepared geometries (it's suprising that there is like, 0 documentation or academic papers easily found and its all source code and probably github reviews), my understanding prepared geometries comes largely from geo-rust community and discussion https://github.com/georust/geo/issues/803 is that you create a spatial index upfront (this is the preparation). Then the index can be used in subsequent calls etc to make narrowing down candidates far faster.

JosiahParry avatar Jul 16 '23 20:07 JosiahParry

@Robinlovelace, little offtopic, but I tried {sfheaders} to extract coordinates from geometry and it is slower than {sf}. However, it doesn't really matter because this operation has little overhead. The second thing, maybe {sfheaders} shouldn't return two columns (i.e. sfg_id, point_id) -- they are rather unnecessary for points.

Benchmark
library("sf")
library("sfheaders")

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 500000
pts_sf = st_sample(poly_sf, n)

t = bench::mark(
  check = FALSE, iterations = 10,
  sf = st_coordinates(pts_sf),
  sfheaders = sfc_to_df(pts_sf)
)
t[, 1:5]
#>   expression      min   median `itr/sec` mem_alloc
#> 1 sf           8.69ms   11.2ms     45.1     15.3MB
#> 2 sfheaders  145.88ms  152.1ms      6.61      21MB

kadyb avatar Jul 18 '23 20:07 kadyb

Very interesting! I've found sfheaders v. fast for some things but clearly not all. Thanks for all these benchmarks, really helpful if not essential for working out what to use when speed's an issue!

Robinlovelace avatar Jul 18 '23 20:07 Robinlovelace

This should be vastly improved in the new refactor branch. I just pushed a commit to parallelize the distance matrix.

It's not complete yet but the calculation portion is there. Just need to move from a list to a matrix.

https://github.com/JosiahParry/rsgeo/commit/a2ab18a61ed0433bcc039bdb73f3d8d90835729a#diff-ba040067b83f26349efd96faedb319914955849e8949c3a157330a1c0d895dae

image

JosiahParry avatar Aug 25 '23 02:08 JosiahParry

The one to compare against is geos, faster than rsgeo for union in mini benchmark I put together: https://ogh23.robinlovelace.net/tidy#rsgeo

Robinlovelace avatar Aug 25 '23 04:08 Robinlovelace

It's definitely going to remain faster for unioning. I think there will be a minor speed up compared to your existing benchmark but geos is very good here 😛

JosiahParry avatar Aug 25 '23 11:08 JosiahParry

BTW @Robinlovelace, isn't the world object CRS WGS84? Then you are comparing the s2 spherical engine in {sf} with planar in {geos} and {rsgeo}. From my observations, in general GEOS is faster than s2. In this case, you can also use the argument is_coverage = TRUE and it will be even faster.

kadyb avatar Aug 25 '23 12:08 kadyb

BTW @Robinlovelace, isn't the world object CRS WGS84?

Correct, although as we're doing a union the results shouldn't differ too much, right (although I did notice several differences...)? Benchmark adaptations incoming!

Robinlovelace avatar Aug 25 '23 13:08 Robinlovelace

A note on the distance matrix calculations. Rfast calculates a distance matrix between a vector and itself. sf can do one or the other. When you calculate a distance matrix where y = x you only need to calculate one of the triangles and the diagonal is always 0. When the other vector is not identical you do not have symmetry and cannot shortcut the calculation. When we pass in a value of y for st_distance() the computation is so much slower. And to compare to Rfast we have to use dista().

As of right now rsgeo always computes the full dense matrix. I can and should add an enhancement to check if x is equal to y and short cut it.

Edit: adding bench mark for Rfast and new distance matrix. When comparing apples to apples they are on par with eachother.

library("sf")
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)

n = 5000
pts_sf = st_sample(poly_sf, n)

system.time(st_distance(pts_sf))
#>    user  system elapsed 
#>   0.671   0.094   0.770
system.time(st_distance(pts_sf, rev(pts_sf)))
#>    user  system elapsed 
#>   6.197   0.064   6.278

library(rsgeo)

x <- rsgeo:::from_sfc(pts_sf)

system.time(distance_euclidean_matrix(x, rev(x)))
#>    user  system elapsed 
#>   0.348   0.168   0.321

crds <- coords(x)
c2 <- coords(rev(x))

system.time(Rfast::dista(crds, c2))
#>    user  system elapsed 
#>   0.228   0.034   0.266


bench::mark(
  Rfast = Rfast::dista(crds, c2),
  rsgeo = distance_euclidean_matrix(x, rev(x)),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Rfast         263ms    266ms      3.76     382MB     7.51
#> 2 rsgeo         215ms    252ms      3.97     191MB     1.99

JosiahParry avatar Aug 25 '23 18:08 JosiahParry