sf icon indicating copy to clipboard operation
sf copied to clipboard

distinct() error with sticky geom column

Open baderstine opened this issue 5 years ago • 4 comments

Copied over from https://github.com/tidyverse/dplyr/issues/5378

When using distinct() with sf objects, the sticky geom column can cause unexpected errors because it is not ignored:

library(maps)
library(sf)
library(dplyr)

county_sf = st_as_sf(maps::map(database = "county", plot=FALSE, fill=TRUE))

county_sf %>% distinct(vars(ID))

Error in CPL_geos_binop(st_geometry(x), st_geometry(y), op, par, pattern, : Evaluation error: TopologyException: side location conflict at -104.96586608886719 40.0096435546875.

I would expect this to return all rows that have distinct IDs, ignoring 'geom' and not failing with an error.

baderstine avatar Aug 28 '20 15:08 baderstine

Currently, distinct.sf works on the variables specified and the geometry column; since that one, in your example, contains invalid geometries the error occurs when finding distinct geometries. A way out is to select the valid geometries first, e.g. by

county_sf %>% filter(st_is_valid(.)) %>% distinct(vars(ID))

but this still filters on distinct geometries, in addition to those you specify. I can see the point that you wouldn't expect that.

edzer avatar Aug 28 '20 16:08 edzer

aha! Now that I know to look for it, I found the following: "distinct gives distinct records for which all attributes and geometries are distinct; st_equals is used to find out which geometries are distinct." in the sf::tidyverse help page that results from ??distinct.sf

But this behavior is still surprising as I would expect to have to specify the geom column(s) in the call if I actually wanted to include it/them in distinct()?

baderstine avatar Aug 28 '20 16:08 baderstine

I was about to post something along the same lines (problem using distinct() with sf objects)

library(sf)
library(dplyr)

Areas <- st_as_sf(tibble(
      Area =c("Zone1", "Zone1","Zone2","Zone1"),
      Zone =c("Area27","Area27","Area42","Area27"),
      lng = c(20.1, 20.2, 20.1, 20.1),
      lat = c(-1.1, -1.2, -1.1, -1.1)),
      coords = c("lng", "lat")) 

Areas %>% distinct(across(-geometry),.keep_all=TRUE)

In my example I've tried to explicitly exclude the geom column, but it doesn't work when I would expect it to.

I've put together a workaround, but I'm not sure how scalable this solution would be with larger datasets

Areas %>% group_by(Area,Zone) %>% 
          mutate(id = row_number()) %>% 
          filter(id == 1) %>% 
          select(-id)

geogale avatar Sep 03 '20 08:09 geogale

Hello, I was thinking about this issue. I think how distinct.sf should work ties a bit into -how it distinguishes itself from dplyr::distinct that works for data frames. The output of distinct.sf should be an sf object in my opinion (I am open to discussion on this. ) However If column arguments passed do not have the geometry column specified and have .keep_all=FALSE, then perhaps there would be no difference between distinct.sf and dplyr::distinct. One could just do (to get the same output, without compromising on scalability):

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
Areas <- tibble::tibble(
  Area =c("Zone1", "Zone1","Zone2","Zone1"),
  Zone =c("Area27","Area27","Area42","Area27"),
  lng = c(20.1, 20.2, 20.1, 20.1),
  lat = c(-1.1, -1.2, -1.1, -1.1))

dplyr::distinct(Areas, Area, Zone, .keep_all=TRUE) 
#> # A tibble: 2 x 4
#>   Area  Zone     lng   lat
#>   <chr> <chr>  <dbl> <dbl>
#> 1 Zone1 Area27  20.1  -1.1
#> 2 Zone2 Area42  20.1  -1.1

Created on 2021-01-15 by the reprex package (v0.3.0) That being said I do feel, distinct.sf should be able to accept other column arguments and find distinctiveness on the basis of those as well.

So I am proposing the following function for distinct.sf and have provided its use case/benchmark here

t_n <- function(.data, ..., .keep_all = FALSE) {
  if (!requireNamespace("dplyr", quietly = TRUE))
    stop("dplyr required: install that first") # nocov
  if (!requireNamespace("rlang", quietly = TRUE))
    stop("rlang required: install first?")
  
  sf_column = attr(.data, "sf_column")
  geom = st_geometry(.data)
  if(length(which(st_is_valid(.data) == FALSE)) >= 1 ){
    stop("Invalid Geometry")
  }
  .data = as.data.frame(.data) %>% 
    dplyr::distinct(. , ..., .keep_all = .keep_all)
  if (!(sf_column %in% colnames(.data))) {
    rlang::abort(
      message = "No geometry column passed. Pass .keep_all = TRUE")
  }
  
  st_as_sf(.data)
  
}

Created on 2021-01-15 by the reprex package (v0.3.0) UseCase

library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(tidyverse)

Areas <- st_as_sf(tibble(
  Area =c("Zone1", "Zone1","Zone2","Zone1"),
  Zone =c("Area27","Area27","Area42","Area27"),
  lng = c(20.1, 20.2, 20.1, 20.1),
  lat = c(-1.1, -1.2, -1.1, -1.1)),
  coords = c("lng", "lat"))
t_n(Areas, Area, Zone, geometry)
#> Simple feature collection with 3 features and 2 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 20.1 ymin: -1.2 xmax: 20.2 ymax: -1.1
#> CRS:            NA
#>    Area   Zone          geometry
#> 1 Zone1 Area27 POINT (20.1 -1.1)
#> 2 Zone1 Area27 POINT (20.2 -1.2)
#> 3 Zone2 Area42 POINT (20.1 -1.1)
t_n(Areas, Area, Zone, .keep_all = TRUE)
#> Simple feature collection with 2 features and 2 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 20.1 ymin: -1.1 xmax: 20.1 ymax: -1.1
#> CRS:            NA
#>    Area   Zone          geometry
#> 1 Zone1 Area27 POINT (20.1 -1.1)
#> 2 Zone2 Area42 POINT (20.1 -1.1)

county_sf = st_as_sf(maps::map(database = "county", plot=FALSE, fill=TRUE))
county_sf %>% filter(st_is_valid(.)) %>% t_n(ID, .keep_all = TRUE) %>% dim()
#> [1] 3044    2

nc = st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.6/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> CRS:            4267
nc=  nc[c(1:100, 1:10), ]
dim(nc)
#> [1] 110  15
t_n(nc) %>% dim()
#> [1] 100  15

Created on 2021-01-15 by the reprex package (v0.3.0) Benchmarking

nc = st_read(system.file("shape/nc.shp", package="sf"))
#> Reading layer `nc' from data source `/Library/Frameworks/R.framework/Versions/3.6/Resources/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> CRS:            4267
nc=  nc[c(1:100, 1:10), ] #%>% distinct() %>% nrow()
o_func <- function() distinct(nc)
n_func <- function() t_n(nc)
bench::mark(o_func(), n_func(), check = FALSE)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 o_func()     56.3ms   63.3ms      15.7     772KB     2.24
#> 2 n_func()     36.2ms   40.1ms      21.5     782KB    32.2

Created on 2021-01-15 by the reprex package (v0.3.0) In terms of documentation, the arguments have the same meaning as dplyr::distinct. I have not used st_equals which I feel could be disadvantageous for certain use cases. If such examples exist, I look forward to hearing about them or generally any suggestions/feedback on the matter.

I am happy to put in a PR for testing.

Thanks,

KiranmayiV avatar Jan 13 '21 09:01 KiranmayiV