sf
sf copied to clipboard
Low-hanging performance improvement opportunities
As hinted on Twitter and as per our discussion earlier, I think there are a few easy things that could speed up a number of things considerably! The main one is st_sfc(), which vapply()s along the list a few times. For a million points, the identity subset takes over two seconds:
library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1; sf_use_s2() is TRUE
n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
points_sf <- st_as_sf(points, coords = c("x", "y"))
system.time(points_sf$geometry[])
#> user system elapsed
#> 2.009 0.070 2.078
The vapply() calls are here:
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L51
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L56
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L84
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L124
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L128
Given the amount of times [.sfc and st_sfc() get called, I think it's worth some C++ to calculate the sfc attributes in a single pass. I'm happy to contribute this if it would be welcome.
One other quality-of-life improvement would be removing the vapply() in print.sfc(), which takes about a second for the above points example:
https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L199
It seems like that information is already in the class() anyway? I'm also happy to contribute that fix.
This was a little more involved than I'd hoped! The searching for NULL values is still needed, but dropping the need for looking for NAs in the list (generated by merge) helps quite a bit; this is now obsolete by the added is.na<-.sfc method which merge.data.frame() calls.
Awesome!
One other vapply() is taking up a lot of time as well, which I found playing with your commit! I think this one is an easy fix (lengths() can probably do it).
https://github.com/r-spatial/sf/blob/d73e0c799c0f47093a1c7085a001fe950621c91c/R/bbox.R#L13
(edit: actually that looks harder...it's looking for the empty point?)
I think you probably still want to check for nulls and check for classes in C++...I used the subset to illustrate the problem, but reading from a file and a GEOS/S2 op also go through st_sfc() and the vapply()s take a significant amount of time here too (I checked using RStudio's profiler to make sure it wasn't something else). Some ideas:
# remotes::install_github("r-spatial/sf@d73e0c7")
library(sf)
#> Linking to GEOS 3.10.1, GDAL 3.3.3, PROJ 8.1.0; sf_use_s2() is TRUE
n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
system.time(points_sf <- st_as_sf(points, coords = c("x", "y")))
#> user system elapsed
#> 0.408 0.026 0.434
# nice!
system.time(points_sf$geometry[])
#> user system elapsed
#> 0.699 0.062 0.762
# but doesn't solve the case of a GEOS or S2 operation, which comes back as WKB
system.time(centroids <- st_centroid(points_sf))
#> Warning in st_centroid.sf(points_sf): st_centroid assumes attributes are
#> constant over geometries of x
#> user system elapsed
#> 4.197 0.113 4.315
# ...or st_read(), where this can take a significant amount of time
temp <- tempfile(fileext = ".fgb")
write_sf(points_sf, temp)
system.time(roundtrip <- read_sf(temp))
#> user system elapsed
#> 5.215 0.130 5.354
(edit: actually that looks harder...it's looking for the empty point?)
yes, the lengths could be dropped and anyNA should be sufficient - the length() is there from an early stage where I deliberated representing a POINT EMPTY by a zero length numeric vector, but that never happened.
I still think that some C++ is going to help here...I drafted some bits below that may help and would be happy to PR them if that is easier. The gist of it is that the vapply() stops being usable after a million elements...when we get to 10 million elements, the overhead of the R function calls takes up 1 - 4 seconds per vapply(), and there are a few of them for each st_sfc() call.
I think it's possible to get the whole st_sfc() operation down to < 0.1 s per million points with a few short functions. I've drafted two of them here...they can be written in C as well if you're worried about compile time. It might be worth it anyway since Rcpp incurs some significant overhead when dealing with strings in a loop.
library(sf)
#> Linking to GEOS 3.10.1, GDAL 3.3.3, PROJ 8.1.0; sf_use_s2() is TRUE
n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
points_sf <- st_as_sf(points, coords = c("x", "y"))
Rcpp::sourceCpp(code = '
#include <unordered_set>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector sfc_is_null(List sfc) {
LogicalVector out(sfc.size());
for (R_xlen_t i = 0; i < sfc.size(); i++) {
out[i] = sfc[i] == R_NilValue;
}
return out;
}
// [[Rcpp::export]]
List sfc_unique_sfg_classes(List sfc) {
if (sfc.size() == 0) {
return(List::create(CharacterVector::create(), CharacterVector::create()));
}
std::unordered_set<std::string> class_dim;
std::unordered_set<std::string> class_type;
for (R_xlen_t i = 0; i < sfc.size(); i++) {
SEXP item = sfc[i];
// This looks like a huge mess, but circumventing Rcpp for this
// check is about 10x faster
if (Rf_inherits(item, "XY")) {
class_dim.insert("XY");
} else if (Rf_inherits(item, "XYZ")) {
class_dim.insert("XYZ");
} else if (Rf_inherits(item, "XYM")) {
class_dim.insert("XYM");
} else if (Rf_inherits(item, "XYZM")) {
class_dim.insert("XYZM");
}
if (Rf_inherits(item, "POINT")) {
class_type.insert("POINT");
continue;
} else if (Rf_inherits(item, "LINESTRING")) {
class_type.insert("LINESTRING");
continue;
} else if (Rf_inherits(item, "POLYGON")) {
class_type.insert("POLYGON");
continue;
} else if (Rf_inherits(item, "MULTIPOINT")) {
class_type.insert("MULTIPOINT");
continue;
} else if (Rf_inherits(item, "MULTILINESTRING")) {
class_type.insert("MULTILINESTRING");
continue;
} else if (Rf_inherits(item, "MULTIPOLYGON")) {
class_type.insert("MULTIPOLYGON");
continue;
}
RObject itemSlow(sfc[i]);
RObject classes = itemSlow.attr("class");
if (classes == R_NilValue) {
continue;
}
CharacterVector classes_chr(classes);
if (classes_chr.size() == 3) {
class_type.insert(static_cast<std::string>(classes_chr[1]));
}
}
CharacterVector class_dim_chr(class_dim.begin(), class_dim.end());
CharacterVector class_type_chr(class_type.begin(), class_type.end());
return List::create(class_dim_chr, class_type_chr);
}
')
bench::mark(
vapply(points_sf$geometry, is.null, logical(1)),
sfc_is_null(points_sf$geometry),
sfc_unique_sfg_classes(points_sf$geometry),
check = FALSE,
iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 vapply(points_sf$geometry, is.null, logical(1)) 464.56ms 468.26ms 2.07
#> 2 sfc_is_null(points_sf$geometry) 3.06ms 3.07ms 302.
#> 3 sfc_unique_sfg_classes(points_sf$geometry) 40.58ms 40.89ms 24.3
#> # … with 2 more variables: mem_alloc <bch:byt>, gc/sec <dbl>
Created on 2022-04-05 by the reprex package (v2.0.1)
Yes, that looks certainly worth the effort, and readable code - PR very welcome!
I just installed the new version and did quick tests. With these changes, significant speedup is also visable in other functions:
st_sample(): 13.8 s -> 8.4 s (40%)read_sf(): 5.2 s -> 1.3 s (75%)st_transform(): 5.6 s -> 1.9 s (66%)
Great work!
Today I re-ran benchmark and after update sf 1.0.9 -> 1.0.12 and R 4.2.2 -> 4.3.0 I can see that the st_sample() function is ~3 times faster (7.44 s -> 2.28 s). I checked the changelog, but this is mystery to me what is the reason for this. Now sf::st_sample() is even faster than terra::spatSample() (5.64 s).