CoordinateCleaner icon indicating copy to clipboard operation
CoordinateCleaner copied to clipboard

Bug: wrong record flagged in clean_coordinates using data.frame

Open ltalluto opened this issue 3 years ago • 0 comments

I'm running into a weird issue when running clean_coordinates. Reproducible example below. I have a data set with an obvious outlier. The basic workflow is 1. Get data from GBIF, (1a) convert tibble to data.frame, 2. Remove NA coords using subset, 3. Run clean_coordinates.

If I skip step 1a (i.e., send a tibble to clean_coordinates instead of a data.frame), the outlier is correctly flagged. If I do step 1a, a record is flagged, but it is the wrong row. If I convert to data.frame AFTER subset, everything is fine. If use square brackets instead of subset.data.frame, everything is fine. I have verified that the data in all columns is identical after subsetting, regardless of whether I subset on the tibble or the data.frame.

R 3.63, macOS 10.15.7, rgbif 3.5.2, CoordinateCleaner 2.0.18, tibble 3.0.4.

library(rgbif)
library(CoordinateCleaner)

dat = occ_search(scientificName = "Sorex alpinus", limit=250)$data
dat_df = as.data.frame(dat)
dat_df_no_subset = dat_df_no_subset = as.data.frame(dat)

dat = subset(dat, !is.na(decimalLatitude))
dat_df = subset(dat_df, !is.na(decimalLatitude))
dat_df_no_subset = dat_df_no_subset[!is.na(dat_df_no_subset$decimalLatitude),]

## all of the data in all 3 tables is identical, as expected
all(mapply(identical, dat, dat_df))
all(mapply(identical, dat, dat_df_no_subset))

cl = clean_coordinates(dat, lon="decimalLongitude", lat="decimalLatitude", tests="outliers")
cl_df = clean_coordinates(dat_df, lon="decimalLongitude", lat="decimalLatitude", tests="outliers")
cl_conv_late = clean_coordinates(as.data.frame(dat), lon="decimalLongitude", 
	lat="decimalLatitude", tests="outliers")
cl_no_subset = clean_coordinates(as.data.frame(dat), lon="decimalLongitude", 
	lat="decimalLatitude", tests="outliers")

## different records are flagged, but only if converting to data frame before using subset
which(! cl$.summary)
which(! cl_df$.summary)
which(! cl_conv_late$.summary)
which(! cl_no_subset$.summary)

ltalluto avatar Apr 14 '21 13:04 ltalluto