sf
sf copied to clipboard
distinct() error with sticky geom column
Copied over from https://github.com/tidyverse/dplyr/issues/5378
When using distinct() with sf objects, the sticky geom column can cause unexpected errors because it is not ignored:
library(maps)
library(sf)
library(dplyr)
county_sf = st_as_sf(maps::map(database = "county", plot=FALSE, fill=TRUE))
county_sf %>% distinct(vars(ID))
Error in CPL_geos_binop(st_geometry(x), st_geometry(y), op, par, pattern, : Evaluation error: TopologyException: side location conflict at -104.96586608886719 40.0096435546875.
I would expect this to return all rows that have distinct IDs, ignoring 'geom' and not failing with an error.
Currently, distinct.sf works on the variables specified and the geometry column; since that one, in your example, contains invalid geometries the error occurs when finding distinct geometries. A way out is to select the valid geometries first, e.g. by
county_sf %>% filter(st_is_valid(.)) %>% distinct(vars(ID))
but this still filters on distinct geometries, in addition to those you specify. I can see the point that you wouldn't expect that.
aha! Now that I know to look for it, I found the following:
"distinct gives distinct records for which all attributes and geometries are distinct; st_equals is used to find out which geometries are distinct." in the sf::tidyverse help page that results from ??distinct.sf
But this behavior is still surprising as I would expect to have to specify the geom column(s) in the call if I actually wanted to include it/them in distinct()?
I was about to post something along the same lines (problem using distinct() with sf objects)
library(sf)
library(dplyr)
Areas <- st_as_sf(tibble(
Area =c("Zone1", "Zone1","Zone2","Zone1"),
Zone =c("Area27","Area27","Area42","Area27"),
lng = c(20.1, 20.2, 20.1, 20.1),
lat = c(-1.1, -1.2, -1.1, -1.1)),
coords = c("lng", "lat"))
Areas %>% distinct(across(-geometry),.keep_all=TRUE)
In my example I've tried to explicitly exclude the geom column, but it doesn't work when I would expect it to.
I've put together a workaround, but I'm not sure how scalable this solution would be with larger datasets
Areas %>% group_by(Area,Zone) %>%
mutate(id = row_number()) %>%
filter(id == 1) %>%
select(-id)
Hello, I was thinking about this issue. I think how distinct.sf should work ties a bit into -how it distinguishes itself from dplyr::distinct that works for data frames. The output of distinct.sf should be an sf object in my opinion (I am open to discussion on this. )
However If column arguments passed do not have the geometry column specified and have .keep_all=FALSE, then perhaps there would be no difference between distinct.sf and dplyr::distinct. One could just do (to get the same output, without compromising on scalability):
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
Areas <- tibble::tibble(
Area =c("Zone1", "Zone1","Zone2","Zone1"),
Zone =c("Area27","Area27","Area42","Area27"),
lng = c(20.1, 20.2, 20.1, 20.1),
lat = c(-1.1, -1.2, -1.1, -1.1))
dplyr::distinct(Areas, Area, Zone, .keep_all=TRUE)
#> # A tibble: 2 x 4
#> Area Zone lng lat
#> <chr> <chr> <dbl> <dbl>
#> 1 Zone1 Area27 20.1 -1.1
#> 2 Zone2 Area42 20.1 -1.1
Created on 2021-01-15 by the reprex package (v0.3.0)
That being said I do feel, distinct.sf should be able to accept other column arguments and find distinctiveness on the basis of those as well.
So I am proposing the following function for distinct.sf and have provided its use case/benchmark here
t_n <- function(.data, ..., .keep_all = FALSE) {
if (!requireNamespace("dplyr", quietly = TRUE))
stop("dplyr required: install that first") # nocov
if (!requireNamespace("rlang", quietly = TRUE))
stop("rlang required: install first?")
sf_column = attr(.data, "sf_column")
geom = st_geometry(.data)
if(length(which(st_is_valid(.data) == FALSE)) >= 1 ){
stop("Invalid Geometry")
}
.data = as.data.frame(.data) %>%
dplyr::distinct(. , ..., .keep_all = .keep_all)
if (!(sf_column %in% colnames(.data))) {
rlang::abort(
message = "No geometry column passed. Pass .keep_all = TRUE")
}
st_as_sf(.data)
}
Created on 2021-01-15 by the reprex package (v0.3.0) UseCase
library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tidyverse)
Areas <- st_as_sf(tibble(
Area =c("Zone1", "Zone1","Zone2","Zone1"),
Zone =c("Area27","Area27","Area42","Area27"),
lng = c(20.1, 20.2, 20.1, 20.1),
lat = c(-1.1, -1.2, -1.1, -1.1)),
coords = c("lng", "lat"))
t_n(Areas, Area, Zone, geometry)
#> Simple feature collection with 3 features and 2 fields
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 20.1 ymin: -1.2 xmax: 20.2 ymax: -1.1
#> CRS: NA
#> Area Zone geometry
#> 1 Zone1 Area27 POINT (20.1 -1.1)
#> 2 Zone1 Area27 POINT (20.2 -1.2)
#> 3 Zone2 Area42 POINT (20.1 -1.1)
t_n(Areas, Area, Zone, .keep_all = TRUE)
#> Simple feature collection with 2 features and 2 fields
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 20.1 ymin: -1.1 xmax: 20.1 ymax: -1.1
#> CRS: NA
#> Area Zone geometry
#> 1 Zone1 Area27 POINT (20.1 -1.1)
#> 2 Zone2 Area42 POINT (20.1 -1.1)
county_sf = st_as_sf(maps::map(database = "county", plot=FALSE, fill=TRUE))
county_sf %>% filter(st_is_valid(.)) %>% t_n(ID, .keep_all = TRUE) %>% dim()
#> [1] 3044 2
nc = st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.6/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> CRS: 4267
nc= nc[c(1:100, 1:10), ]
dim(nc)
#> [1] 110 15
t_n(nc) %>% dim()
#> [1] 100 15
Created on 2021-01-15 by the reprex package (v0.3.0) Benchmarking
nc = st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.6/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> CRS: 4267
nc= nc[c(1:100, 1:10), ] #%>% distinct() %>% nrow()
o_func <- function() distinct(nc)
n_func <- function() t_n(nc)
bench::mark(o_func(), n_func(), check = FALSE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 o_func() 56.3ms 63.3ms 15.7 772KB 2.24
#> 2 n_func() 36.2ms 40.1ms 21.5 782KB 32.2
Created on 2021-01-15 by the reprex package (v0.3.0)
In terms of documentation, the arguments have the same meaning as dplyr::distinct. I have not used st_equals which I feel could be disadvantageous for certain use cases. If such examples exist, I look forward to hearing about them or generally any suggestions/feedback on the matter.
I am happy to put in a PR for testing.
Thanks,