collapse icon indicating copy to clipboard operation
collapse copied to clipboard

fsubset doesn't recalculate bbox for sf objects

Open kendonB opened this issue 3 years ago • 3 comments

library(sf)
library(collapse)
library(dplyr)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
st_bbox(nc)
#>      xmin      ymin      xmax      ymax 
#> -84.32385  33.88199 -75.45698  36.58965
st_bbox(nc %>% filter(NAME == "Ashe"))
#>      xmin      ymin      xmax      ymax 
#> -81.74107  36.23436 -81.23989  36.58965
st_bbox(nc %>% fsubset(NAME == "Ashe"))
#>      xmin      ymin      xmax      ymax 
#> -84.32385  33.88199 -75.45698  36.58965

Created on 2022-10-03 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Ubuntu 22.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_NZ:en
#>  collate  en_NZ.UTF-8
#>  ctype    en_NZ.UTF-8
#>  tz       Pacific/Auckland
#>  date     2022-10-03
#>  pandoc   2.18 @ /usr/lib/rstudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [3] RSPM (R 4.2.0)
#>  class         7.3-20  2022-01-13 [3] RSPM (R 4.2.0)
#>  classInt      0.4-7   2022-06-10 [3] RSPM (R 4.2.0)
#>  cli           3.4.1   2022-09-23 [3] RSPM (R 4.2.0)
#>  collapse    * 1.8.8   2022-08-15 [3] RSPM (R 4.2.0)
#>  DBI           1.1.3   2022-06-18 [3] RSPM (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [3] RSPM (R 4.2.0)
#>  dplyr       * 1.0.10  2022-09-01 [3] RSPM (R 4.2.0)
#>  e1071         1.7-11  2022-06-07 [3] RSPM (R 4.2.0)
#>  evaluate      0.16    2022-08-09 [3] RSPM (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [3] RSPM (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [3] RSPM (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [3] RSPM (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [3] RSPM (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [3] RSPM (R 4.2.0)
#>  highr         0.9     2021-04-16 [3] RSPM (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [3] RSPM (R 4.2.0)
#>  KernSmooth    2.23-20 2021-05-03 [3] RSPM (R 4.2.0)
#>  knitr         1.40    2022-08-24 [3] RSPM (R 4.2.0)
#>  lifecycle     1.0.2   2022-09-09 [3] RSPM (R 4.2.1)
#>  magrittr      2.0.3   2022-03-30 [3] RSPM (R 4.2.0)
#>  pillar        1.8.1   2022-08-19 [3] RSPM (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [3] RSPM (R 4.2.0)
#>  proxy         0.4-27  2022-06-09 [3] RSPM (R 4.2.0)
#>  purrr         0.3.4   2020-04-17 [3] RSPM (R 4.2.0)
#>  R.cache       0.16.0  2022-07-21 [3] RSPM (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [3] RSPM (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [3] RSPM (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [3] RSPM (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [3] RSPM (R 4.2.0)
#>  Rcpp          1.0.9   2022-07-08 [3] RSPM (R 4.2.0)
#>  reprex        2.0.2   2022-08-17 [3] RSPM (R 4.2.0)
#>  rlang         1.0.6   2022-09-24 [3] RSPM (R 4.2.0)
#>  rmarkdown     2.16    2022-08-24 [3] RSPM (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [3] RSPM (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [3] RSPM (R 4.2.0)
#>  sf          * 1.0-8   2022-07-14 [3] RSPM (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [3] RSPM (R 4.2.0)
#>  stringr       1.4.1   2022-08-20 [3] RSPM (R 4.2.0)
#>  styler        1.7.0   2022-03-13 [3] RSPM (R 4.2.0)
#>  tibble        3.1.8   2022-07-22 [3] RSPM (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [3] RSPM (R 4.2.0)
#>  units         0.8-0   2022-02-05 [3] RSPM (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [3] RSPM (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [3] RSPM (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [3] RSPM (R 4.2.0)
#>  xfun          0.33    2022-09-12 [3] RSPM (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [3] RSPM (R 4.2.0)
#> 
#>  [1] /home/kendonb/R/x86_64-pc-linux-gnu-library/4.2
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

kendonB avatar Oct 03 '22 01:10 kendonB

Yes, collapse does not perform any 'spatial' operations, you need to use st_make_valid() or similar if this is important to your application.

SebKrantz avatar Oct 03 '22 10:10 SebKrantz

This could be addressed in the sf package as well and I believe the code lives there for filter.sf. Would this be something you would be interested in @edzer?

kendonB avatar Oct 03 '22 21:10 kendonB

Just to be clear conceptually: the reason collapse is so fast is because I do almost everything at C-level without calling other functions or methods present at the R level. fsubset() is no exception. For sf objects I simply additionally check for the geometry column and preserve that one in addition to any other selected columns. All this happens at C-level, and I do not intend to write a C-program that goes through all the geometries to recalculate the bounding box (also given that this is rather expensive and often not needed).

On the other hand dplyr::filter works quite differently, I believe it eventually invokes [, so that for sf objects you get a call to [.sf, which includes code to recalculate that bounding box. So it's a tradeoff here between performance and accuracy, and you'll need to choose depending on what is required in your application. I note that base::subset also calls [ (thus gives correct bbox as well), and has less overhead than filter:

library(sf)
library(collapse)
library(dplyr)
library(microbenchmark)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27

microbenchmark(nc %>% filter(NAME == "Ashe"),
               nc %>% subset(NAME == "Ashe"),
               nc %>% fsubset(NAME == "Ashe"))
#> Warning in microbenchmark(nc %>% filter(NAME == "Ashe"), nc %>% subset(NAME == : less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#>                            expr      min        lq       mean   median       uq       max neval
#>   nc %>% filter(NAME == "Ashe") 1411.917 1493.3430 1876.27111 1609.148 1825.791 10725.067   100
#>   nc %>% subset(NAME == "Ashe")  240.998  277.0165  356.71599  299.505  336.077  3212.883   100
#>  nc %>% fsubset(NAME == "Ashe")    5.617    7.2570   14.23356   10.783   16.646   152.438   100

Created on 2022-10-04 by the reprex package (v2.0.1)

SebKrantz avatar Oct 03 '22 22:10 SebKrantz