sf Low-hanging performance improvement opportunities

As hinted on Twitter and as per our discussion earlier, I think there are a few easy things that could speed up a number of things considerably! The main one is st_sfc(), which vapply()s along the list a few times. For a million points, the identity subset takes over two seconds:

library(sf)
#> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1; sf_use_s2() is TRUE
n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
points_sf <- st_as_sf(points, coords = c("x", "y"))
system.time(points_sf$geometry[])
#>    user  system elapsed 
#>   2.009   0.070   2.078

The vapply() calls are here:

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L51

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L56

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L84

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L124

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L128

Given the amount of times [.sfc and st_sfc() get called, I think it's worth some C++ to calculate the sfc attributes in a single pass. I'm happy to contribute this if it would be welcome.

One other quality-of-life improvement would be removing the vapply() in print.sfc(), which takes about a second for the above points example:

https://github.com/r-spatial/sf/blob/4b455d5edae1af1b65764636e0fbe8ecdcd1c3ed/R/sfc.R#L199

It seems like that information is already in the class() anyway? I'm also happy to contribute that fix.

Apr 02 '22 01:04 paleolimbot

This was a little more involved than I'd hoped! The searching for NULL values is still needed, but dropping the need for looking for NAs in the list (generated by merge) helps quite a bit; this is now obsolete by the added is.na<-.sfc method which merge.data.frame() calls.

Apr 03 '22 21:04 edzer

Awesome!

One other vapply() is taking up a lot of time as well, which I found playing with your commit! I think this one is an easy fix (lengths() can probably do it).

https://github.com/r-spatial/sf/blob/d73e0c799c0f47093a1c7085a001fe950621c91c/R/bbox.R#L13

(edit: actually that looks harder...it's looking for the empty point?)

I think you probably still want to check for nulls and check for classes in C++...I used the subset to illustrate the problem, but reading from a file and a GEOS/S2 op also go through st_sfc() and the vapply()s take a significant amount of time here too (I checked using RStudio's profiler to make sure it wasn't something else). Some ideas:

# remotes::install_github("r-spatial/sf@d73e0c7")
library(sf)
#> Linking to GEOS 3.10.1, GDAL 3.3.3, PROJ 8.1.0; sf_use_s2() is TRUE

n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
system.time(points_sf <- st_as_sf(points, coords = c("x", "y")))
#>    user  system elapsed 
#>   0.408   0.026   0.434

# nice!
system.time(points_sf$geometry[])
#>    user  system elapsed 
#>   0.699   0.062   0.762

# but doesn't solve the case of a GEOS or S2 operation, which comes back as WKB
system.time(centroids <- st_centroid(points_sf))
#> Warning in st_centroid.sf(points_sf): st_centroid assumes attributes are
#> constant over geometries of x
#>    user  system elapsed 
#>   4.197   0.113   4.315

# ...or st_read(), where this can take a significant amount of time
temp <- tempfile(fileext = ".fgb")
write_sf(points_sf, temp)
system.time(roundtrip <- read_sf(temp))
#>    user  system elapsed 
#>   5.215   0.130   5.354

Apr 04 '22 01:04 paleolimbot

(edit: actually that looks harder...it's looking for the empty point?)

yes, the lengths could be dropped and anyNA should be sufficient - the length() is there from an early stage where I deliberated representing a POINT EMPTY by a zero length numeric vector, but that never happened.

Apr 04 '22 07:04 edzer

I still think that some C++ is going to help here...I drafted some bits below that may help and would be happy to PR them if that is easier. The gist of it is that the vapply() stops being usable after a million elements...when we get to 10 million elements, the overhead of the R function calls takes up 1 - 4 seconds per vapply(), and there are a few of them for each st_sfc() call.

I think it's possible to get the whole st_sfc() operation down to < 0.1 s per million points with a few short functions. I've drafted two of them here...they can be written in C as well if you're worried about compile time. It might be worth it anyway since Rcpp incurs some significant overhead when dealing with strings in a loop.

library(sf)
#> Linking to GEOS 3.10.1, GDAL 3.3.3, PROJ 8.1.0; sf_use_s2() is TRUE

n <- 1e6
points <- data.frame(id = seq_len(n), x = runif(n), y = runif(n))
points_sf <- st_as_sf(points, coords = c("x", "y"))

Rcpp::sourceCpp(code = '
#include <unordered_set>

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
LogicalVector sfc_is_null(List sfc) {
    LogicalVector out(sfc.size());

    for (R_xlen_t i = 0; i < sfc.size(); i++) {
        out[i] = sfc[i] == R_NilValue;
    }

    return out;
}

// [[Rcpp::export]]
List sfc_unique_sfg_classes(List sfc) {
    if (sfc.size() == 0) {
        return(List::create(CharacterVector::create(), CharacterVector::create()));
    }

    std::unordered_set<std::string> class_dim;
    std::unordered_set<std::string> class_type;

    for (R_xlen_t i = 0; i < sfc.size(); i++) {
        SEXP item = sfc[i];

        // This looks like a huge mess, but circumventing Rcpp for this
        // check is about 10x faster
        if (Rf_inherits(item, "XY")) {
            class_dim.insert("XY");
        } else if (Rf_inherits(item, "XYZ")) {
            class_dim.insert("XYZ");
        } else if (Rf_inherits(item, "XYM")) {
            class_dim.insert("XYM");
        } else if (Rf_inherits(item, "XYZM")) {
            class_dim.insert("XYZM");
        }

        if (Rf_inherits(item, "POINT")) {
            class_type.insert("POINT");
            continue;
        } else if (Rf_inherits(item, "LINESTRING")) {
            class_type.insert("LINESTRING");
            continue;
        } else if (Rf_inherits(item, "POLYGON")) {
            class_type.insert("POLYGON");
            continue;
        } else if (Rf_inherits(item, "MULTIPOINT")) {
            class_type.insert("MULTIPOINT");
            continue;
        } else if (Rf_inherits(item, "MULTILINESTRING")) {
            class_type.insert("MULTILINESTRING");
            continue;
        } else if (Rf_inherits(item, "MULTIPOLYGON")) {
            class_type.insert("MULTIPOLYGON");
            continue;
        }


        RObject itemSlow(sfc[i]);
        RObject classes = itemSlow.attr("class");
        if (classes == R_NilValue) {
            continue;
        }

        CharacterVector classes_chr(classes);
        if (classes_chr.size() == 3) {
            class_type.insert(static_cast<std::string>(classes_chr[1]));
        }
    }

    CharacterVector class_dim_chr(class_dim.begin(), class_dim.end());
    CharacterVector class_type_chr(class_type.begin(), class_type.end());

    return List::create(class_dim_chr, class_type_chr);
}
')

bench::mark(
    vapply(points_sf$geometry, is.null, logical(1)),
    sfc_is_null(points_sf$geometry),
    sfc_unique_sfg_classes(points_sf$geometry),
    check = FALSE,
    iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 vapply(points_sf$geometry, is.null, logical(1)) 464.56ms 468.26ms      2.07
#> 2 sfc_is_null(points_sf$geometry)                   3.06ms   3.07ms    302.  
#> 3 sfc_unique_sfg_classes(points_sf$geometry)       40.58ms  40.89ms     24.3 
#> # … with 2 more variables: mem_alloc <bch:byt>, gc/sec <dbl>

^{Created on 2022-04-05 by the reprex package (v2.0.1)}

Apr 05 '22 19:04 paleolimbot

Yes, that looks certainly worth the effort, and readable code - PR very welcome!

Apr 07 '22 20:04 edzer

I just installed the new version and did quick tests. With these changes, significant speedup is also visable in other functions:

st_sample(): 13.8 s -> 8.4 s (40%)
read_sf(): 5.2 s -> 1.3 s (75%)
st_transform(): 5.6 s -> 1.9 s (66%)

Great work!

Jul 14 '22 22:07 kadyb

Today I re-ran benchmark and after update sf 1.0.9 -> 1.0.12 and R 4.2.2 -> 4.3.0 I can see that the st_sample() function is ~3 times faster (7.44 s -> 2.28 s). I checked the changelog, but this is mystery to me what is the reason for this. Now sf::st_sample() is even faster than terra::spatSample() (5.64 s).

May 02 '23 09:05 kadyb