fsubset doesn't recalculate bbox for sf objects
library(sf)
library(collapse)
library(dplyr)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
st_bbox(nc)
#> xmin ymin xmax ymax
#> -84.32385 33.88199 -75.45698 36.58965
st_bbox(nc %>% filter(NAME == "Ashe"))
#> xmin ymin xmax ymax
#> -81.74107 36.23436 -81.23989 36.58965
st_bbox(nc %>% fsubset(NAME == "Ashe"))
#> xmin ymin xmax ymax
#> -84.32385 33.88199 -75.45698 36.58965
Created on 2022-10-03 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os Ubuntu 22.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en_NZ:en
#> collate en_NZ.UTF-8
#> ctype en_NZ.UTF-8
#> tz Pacific/Auckland
#> date 2022-10-03
#> pandoc 2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [3] RSPM (R 4.2.0)
#> class 7.3-20 2022-01-13 [3] RSPM (R 4.2.0)
#> classInt 0.4-7 2022-06-10 [3] RSPM (R 4.2.0)
#> cli 3.4.1 2022-09-23 [3] RSPM (R 4.2.0)
#> collapse * 1.8.8 2022-08-15 [3] RSPM (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [3] RSPM (R 4.2.0)
#> digest 0.6.29 2021-12-01 [3] RSPM (R 4.2.0)
#> dplyr * 1.0.10 2022-09-01 [3] RSPM (R 4.2.0)
#> e1071 1.7-11 2022-06-07 [3] RSPM (R 4.2.0)
#> evaluate 0.16 2022-08-09 [3] RSPM (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [3] RSPM (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [3] RSPM (R 4.2.0)
#> fs 1.5.2 2021-12-08 [3] RSPM (R 4.2.0)
#> generics 0.1.3 2022-07-05 [3] RSPM (R 4.2.0)
#> glue 1.6.2 2022-02-24 [3] RSPM (R 4.2.0)
#> highr 0.9 2021-04-16 [3] RSPM (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [3] RSPM (R 4.2.0)
#> KernSmooth 2.23-20 2021-05-03 [3] RSPM (R 4.2.0)
#> knitr 1.40 2022-08-24 [3] RSPM (R 4.2.0)
#> lifecycle 1.0.2 2022-09-09 [3] RSPM (R 4.2.1)
#> magrittr 2.0.3 2022-03-30 [3] RSPM (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [3] RSPM (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [3] RSPM (R 4.2.0)
#> proxy 0.4-27 2022-06-09 [3] RSPM (R 4.2.0)
#> purrr 0.3.4 2020-04-17 [3] RSPM (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [3] RSPM (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [3] RSPM (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [3] RSPM (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [3] RSPM (R 4.2.0)
#> R6 2.5.1 2021-08-19 [3] RSPM (R 4.2.0)
#> Rcpp 1.0.9 2022-07-08 [3] RSPM (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [3] RSPM (R 4.2.0)
#> rlang 1.0.6 2022-09-24 [3] RSPM (R 4.2.0)
#> rmarkdown 2.16 2022-08-24 [3] RSPM (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [3] RSPM (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [3] RSPM (R 4.2.0)
#> sf * 1.0-8 2022-07-14 [3] RSPM (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [3] RSPM (R 4.2.0)
#> stringr 1.4.1 2022-08-20 [3] RSPM (R 4.2.0)
#> styler 1.7.0 2022-03-13 [3] RSPM (R 4.2.0)
#> tibble 3.1.8 2022-07-22 [3] RSPM (R 4.2.0)
#> tidyselect 1.1.2 2022-02-21 [3] RSPM (R 4.2.0)
#> units 0.8-0 2022-02-05 [3] RSPM (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [3] RSPM (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [3] RSPM (R 4.2.0)
#> withr 2.5.0 2022-03-03 [3] RSPM (R 4.2.0)
#> xfun 0.33 2022-09-12 [3] RSPM (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [3] RSPM (R 4.2.0)
#>
#> [1] /home/kendonb/R/x86_64-pc-linux-gnu-library/4.2
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Yes, collapse does not perform any 'spatial' operations, you need to use st_make_valid() or similar if this is important to your application.
This could be addressed in the sf package as well and I believe the code lives there for filter.sf. Would this be something you would be interested in @edzer?
Just to be clear conceptually: the reason collapse is so fast is because I do almost everything at C-level without calling other functions or methods present at the R level. fsubset() is no exception. For sf objects I simply additionally check for the geometry column and preserve that one in addition to any other selected columns. All this happens at C-level, and I do not intend to write a C-program that goes through all the geometries to recalculate the bounding box (also given that this is rather expensive and often not needed).
On the other hand dplyr::filter works quite differently, I believe it eventually invokes [, so that for sf objects you get a call to [.sf, which includes code to recalculate that bounding box. So it's a tradeoff here between performance and accuracy, and you'll need to choose depending on what is required in your application. I note that base::subset also calls [ (thus gives correct bbox as well), and has less overhead than filter:
library(sf)
library(collapse)
library(dplyr)
library(microbenchmark)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS: NAD27
microbenchmark(nc %>% filter(NAME == "Ashe"),
nc %>% subset(NAME == "Ashe"),
nc %>% fsubset(NAME == "Ashe"))
#> Warning in microbenchmark(nc %>% filter(NAME == "Ashe"), nc %>% subset(NAME == : less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> nc %>% filter(NAME == "Ashe") 1411.917 1493.3430 1876.27111 1609.148 1825.791 10725.067 100
#> nc %>% subset(NAME == "Ashe") 240.998 277.0165 356.71599 299.505 336.077 3212.883 100
#> nc %>% fsubset(NAME == "Ashe") 5.617 7.2570 14.23356 10.783 16.646 152.438 100
Created on 2022-10-04 by the reprex package (v2.0.1)