spocc icon indicating copy to clipboard operation
spocc copied to clipboard

not a code bug, but a data issue

Open tphilippi opened this issue 4 years ago • 9 comments

Whether using spocc or rgbif & ridigbio separately, the same occurrence records show up ~71m shifted in northern California.

Reproducible example:

library(spocc)
library(rgdal)      
# library(plotKML)
library(mapr)
gbifopts <- list(country = 'US') # search in United States only
occbounds <- c(-121.9347, 40.14134, -120.8895, 40.85288) # Lassen Volcanic NP + 30km
out <- occ(query = 'Rana cascadae', from = c('gbif','bison','inat','idigbio'), 
                  gbifopts = gbifopts, 
                  geometry = occbounds, limit = 5000)
dat <- occ2df(out)  
dat2 <- dat  # so all steps can be examined later
dat2$latitude <- as.numeric(dat2$latitude)
dat2$longitude <- as.numeric(dat2$longitude)
dat2 <- dat2[!is.na(dat2$latitude) & !is.na(dat2$longitude), ]
map_leaflet(dat2)
sessionInfo()

If you zoom in you will find pairs of occurrences with the same date and the idigbio location ~71m ENE of the gbif or other provider version. Note that both gbif and iDigBio assert that data from their APIs come as epsg:4326.

I'm putting this issue here to make others aware. I will continue to explore the discrepancy: is it the same in Florida and Alaska? can I find very narrow point locations so I can verify that the idigbio coordinates are the problem (as I suspect)? I've already reproduced this with rgbif and ridigbio, and can show that there is not a difference based on the museum behind the records in iDigBio. I can try to get the verbatim location info from the data sources to further compare.

Session Info
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mapr_0.4.0  rgdal_1.4-6 sp_1.3-1    spocc_1.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2         rvertnet_0.7.0     lubridate_1.7.4    lattice_0.20-38   
 [5] assertthat_0.2.1   digest_0.6.20      mime_0.7           R6_2.4.0          
 [9] plyr_1.8.4         spam_2.4-0         httr_1.4.1         rbison_0.8.0      
[13] ggplot2_3.2.1      pillar_1.4.2       rlang_0.4.0        lazyeval_0.2.2    
[17] curl_4.0           data.table_1.12.2  whisker_0.4        oai_0.3.0         
[21] urltools_1.7.3     stringr_1.4.0      foreign_0.8-71     htmlwidgets_1.3   
[25] triebeard_0.3.0    munsell_0.5.0      shiny_1.3.2        compiler_3.6.1    
[29] httpuv_1.5.2       pkgconfig_2.0.2    rgeos_0.5-2        htmltools_0.3.6   
[33] tidyselect_0.2.5   geoaxe_0.1.0       tibble_2.1.3       httpcode_0.2.0    
[37] rgbif_1.3.0        crayon_1.3.4       dplyr_0.8.3        later_0.8.0       
[41] crul_0.8.4         grid_3.6.1         rworldmap_1.3-6    jsonlite_1.6      
[45] xtable_1.8-5       gtable_0.3.0       magrittr_1.5       scales_1.0.0      
[49] stringi_1.4.3      mapproj_1.2.6      ridigbio_0.3.5     promises_1.0.1    
[53] rebird_1.0.0       leaflet_2.0.2      xml2_1.2.2         fortunes_1.5-5    
[57] RColorBrewer_1.1-2 tools_3.6.1        glue_1.3.1         purrr_0.3.2       
[61] maps_3.3.0         crosstalk_1.0.0    fields_10.0        yaml_2.2.0        
[65] colorspace_1.4-2   maptools_0.9-7     dotCall64_1.0-0   

tphilippi avatar Nov 13 '19 00:11 tphilippi

thanks for the report @tphilippi

I also see the problem.

Seems like for pairs where gbif and idibio disagree, gbif and bison data agree for the same point, which is nice:

filter(dat2, key == "f9784d28-6462-4175-bb2b-9a584a0aa1bf" | key == "1145396883") %>% data.frame
#>                         name longitude latitude    prov       date                                  key
#> 1 Rana cascadae Slater, 1939 -121.5609 40.35793    gbif 1924-06-20                           1145396883
#> 2              Rana cascadae -121.5609 40.35793   bison       <NA>                           1145396883
#> 3              rana cascadae -121.5601 40.35813 idigbio 1924-06-20 f9784d28-6462-4175-bb2b-9a584a0aa1bf

filter(dat2, key == "523cbddb-1f32-41e0-a5aa-94a25d004029" | key == "1145665190") %>% data.frame
#>                         name longitude latitude    prov       date                                  key
#> 1 Rana cascadae Slater, 1939 -121.5770 40.35226    gbif 1924-05-17                           1145665190
#> 2              Rana cascadae -121.5770 40.35226   bison       <NA>                           1145665190
#> 3              rana cascadae -121.5762 40.35247 idigbio 1924-05-17 523cbddb-1f32-41e0-a5aa-94a25d004029

If you look at the issues for GBIF occurrence 1145396883 - http://api.gbif.org/v1/occurrence/1145396883 - it has COORDINATE_ROUNDED and COORDINATE_REPROJECTED. Same for 1145665190: http://api.gbif.org/v1/occurrence/1145665190/

I imagine its that COORDINATE_REPROJECTED that makes it different from idigbio? For both of those verbatim data has NAD1927 while non-verbatim data has WGS84

sckott avatar Nov 13 '19 18:11 sckott

Right: the issue is that the idigbio results come with verbatim latitude longitude, which may or may not be WGS84 epsg:4326. And, as far as I can tell, the iDigBio API doesn't vend a variable for datum or CRS, even though their web page for an occurrence has that information. Rather, there is a cryptic flag (in the list of flags for each record) that appears to indicate "not WGS84 epsg:4326" but not what the correct datum/geoid/CRS is. I need to dig into their API documentation again.

I tracked the issue via the separate web interfaces: https://www.gbif.org/occurrence/1145397475 scroll down and you can see the original georeferencing in 2002 was to NAD27; gbif reprojected to WGS84. https://www.idigbio.org/portal/records/58999bd0-d35a-4bbf-9695-a51732807867 notes that the geodeticDatum is NAD27 and those coordinates are what both ridigbio::idig_search_records() and spocc::occ() return.

Looking at out$idigbio in the example code above, out$idigbio$data$flags does note "geopoint_datum_error" for most of these records.

I'm going to be parsing the flags from idigbio records delivered as Darwin Core and probably via ridigbio::idig_search_records(). I'll look at that part of the spocc code to see if it will be easy to pull out. I already parse those list-valued variables in the iDigBio DwC-A multimedia.csv files.

Perhaps the man page or vignette should not that users need to check for CRS issues if they need accuracy within a few hundred meters?

As for USGS Bison: Bison is almost certainly harvesting that record from gbif, as that's what it does and the Bison folks are the US node of gbif. I don't trust any location info in Bison, as it isn't easy to see where it is coming from. If you look at the map of bison records, you'll also see one occurrence out there at 40.486264 -121.400602. That's the centroid of Lassen Volcanic National Park. Bison can't harvest our protected individual occurrence records behind NPSpecies, so it harvests the park-level species lists and georeferences them to the park centroids. Once our NPS IT folks get authentication working for our protected data, and we get the full tables of what is considered protected by each park populated, I'm attempting to push NPS occurrence records out to gbif, either directly or via USGS/Bison & Stinger.

tphilippi avatar Nov 13 '19 19:11 tphilippi

Perhaps the man page or vignette should note that users need to check for CRS issues if they need accuracy within a few hundred meters?

  • [ ] inform users about CRS issues

sckott avatar Nov 14 '19 18:11 sckott

interesting about BISON, didn't know that. Anything we should add to the docs about BISON?

sckott avatar Nov 14 '19 18:11 sckott

Issue #38 on ridigbio has a couple of workarounds. Passing a vector of field names via the fields = parameter rather than using the default fields = "all" can get the geodeticDatum value. The field name is "data.dwc:geodeticDatum" which isn't obvious from the documentation. Despite the dwc definition & recommendation of this field as controlled vocabulary of epsg:, it takes non-controlled values for some older records. But, it is easy to first check if all values are either "WGS84" or "EPSG:4326", then if not look at what the odd values are.

I only used spocc to test my gbif vs idigbio offset, so don't have time to code & test a solution, but it may be simple enough to change how spocc calls ridigbio::idig_get_records() to something like:

fieldlist <- c('uuid',"canonicalname", "basisofrecord", "catalognumber",
               "collectioncode", "fieldnumber", "occurrenceid", "recordset",
               "kingdom", "order", "class", "family", "genus", 
               "specificepithet", "infraspecificepithet",
               "scientificname", "taxonrank", "taxonid", "taxonomicstatus", "typestatus",
               "collectionid", "collectionname", "collector",
               "institutioncode", "institutionid", "institutionname",
               "datemodified",
               "datecollected" , "geopoint", "coordinateuncertainty",
               "locality", "verbatimlocality", "data.dwc:geodeticDatum")

test <- idig_search_records(rq = list(scientificname = "Rana cascadae",
                                      geopoint = list(type="geo_bounding_box", 
                                                      top_left=list(lat=40.85288, lon=-121.9347), 
                                                      bottom_right=list(lat=-40.14134, lon= -120.8895)
                                                      )
                                      ),
                              fields = fieldlist)

That fieldlist is the fields I get back from fields = "all" plus the datum. Note that fields with all missing values are dropped, so fields = "all" might have other fields I haven't seen yet in my use, and I think that is on their API server side, not in the R code.

As for any explicit warnings about BISON, I don't think it's needed. eBird is based on picklists of named birding sites, so ~80% of the observations in Cabrillo National Monument are the same spot in bay 2 of the parking lot: the park centroid. The rOpenSci CoordinateCleaner package might flag such BISON records, scrubr tried to do it for political polygons. My de-duplicaton and data QC flagging code tries to do "reverse georeferencing", but that is hard because different polygons datasets give different values for centroids, and for National Parks, the boundaries slightly change over time.

tphilippi avatar Nov 18 '19 18:11 tphilippi

Hello from the iDigBio (and new maintainer of ridigbio) side of things. I'd like to look things over more before I comment on specifics, but I wanted to chime in and say that I'm interested in working together to see how we can make things better.

roncanepa avatar Nov 20 '19 14:11 roncanepa

we hardcode the fields="all" right now https://github.com/ropensci/spocc/blob/master/R/plugins.r#L426 We can let users set fields on their own, but we'd need to make sure certain set of fields are included to make sure the pipeline downstream works (at least taxon name, date, lat/lon) - but that's easy.

in spocc:

fieldlist <- c('uuid',"canonicalname", "basisofrecord", "catalognumber",
   "collectioncode", "fieldnumber", "occurrenceid", "recordset",
   "kingdom", "order", "class", "family", "genus", 
   "specificepithet", "infraspecificepithet",
   "scientificname", "taxonrank", "taxonid", "taxonomicstatus", "typestatus",
   "collectionid", "collectionname", "collector",
   "institutioncode", "institutionid", "institutionname",
   "datemodified", "datecollected" , "geopoint", "coordinateuncertainty",
   "locality", "verbatimlocality", "data.dwc:geodeticDatum")

x <- occ(query = "Rana cascadae", from = "idigbio", limit = 300,
    geometry = c(-121.9347, -40.14134, -120.8895, 40.85288),
    idigbioopts = list(fields = fieldlist))

Which gives 4 unique values, so pretty easy to handle these

unique(x$idigbio$data$Rana_cascadae$`data.dwc:geodeticDatum`)
#> [1] "NAD27" "North American Datum 1927"  NA "World Geodetic System 1984" "WGS84"

What would we do with these values once we have them in spocc? Or do you leave that up to the user?

sckott avatar Nov 20 '19 22:11 sckott

@sckott As I explore further my goal has gone from using data.dwc:geodeticDatum as a check on lon lat geoPoint coordinates to a flag to simply informing the user; now I'm not sure it will be informative.

Currently, lon lat values are given for the geopoint without any CRS or Datum in the data or the metadata. That's very bad practice. Most but not geoPoints from idigbio are WGS84 (which gbif and other sources use). As far as I can tell, this is an issue in iDigBio (due to their georeferencing starting 20 years ago so they were bleeding edge), not an issue caused by spocc, but spocc returning a bunch of records in a silent mixture of CRS will lead to problems and bad science at some point. I'll try to follow up with @roncanepa of iDigBio. There's some slight probability that I happened to trip over a couple of dozen errors in coordinate processing out of the 43M records, and all are meant to be in WGS84, and the web page for each individual record shows the original, raw values, not the same as provided by the API.

For my Rana cascadae example, based on wetlands & streams in imagery as well as where gbif puts the same occurrence, some of the geopoints with dwc:geodeticDatum NAD27 or North American Datum 1927 are really in NAD27 epsg:4267, while some are actually in WGS84. dwc:geodeticDatum is for the raw, original lon lat values (e.g., from the 2002 georeferencing of a 1924 specimen), and at least some coordinate values in the API data have been transformed to WGS84. Therefore, data.dwc:geodeticDatum isn't a silver bullet, or even flag for which values are suspect, or even the Datum for the lon lat values returned.

Another part of the information is the flags that come back as a list x$idigbio$data$Rana_cascadae$flags. They're also in the DwC-A occurrence.csv from the iDigBio portal as field idigbio:flags. I'm looking at several values: "geopoint_datum_missing", "geopoint_datum_error", "rev_geocode_failure". I have code that parses those out to a set of 70 logical values. Would you like that as a stand-alone function (taking the data object x$idigbio$data$Rana_cascadae and returning a data.frame with uuid and the 70 logical flags, or just those with at least 1 TRUE in that dataset)?

It turns out that my R. cascadae use case is a bad example. As an endangered species, the coordinate uncertainties are very inflated (conservative) for the old records georeferenced from named places like lakes or ponds, and new iNaturalist records have locations fuzzed and a coordinate uncertainty of 27km in gbif:

summary(x$gbif$data$Rana_cascadae$coordinateUncertaintyInMeters)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   17.7   531.9  1295.5  1857.2  1295.5 27969.0      56 
>   summary(x$idigbio$data$Rana_cascadae$coordinateuncertainty)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   17.7   531.9  1295.5  1528.5  1295.5 13805.0      54 

Most of those lon lat coordinates I've checked land in the correct pond, so within 100-250m, hence my description of uncertainties 1km and greater as "inflated".

This particular use case is fun: R. cascadae appears to be extirpated from Lassen NP; the park needs to identify locations to reintroduce captive bred frogs. Historic records can identify what was good wetland habitat, both explicit locations and SDM to identify similar un-collected wetlands. Then an eDNA survey will check for relic populations, and maybe for predatory leeches & fungal diseases, identifying good places to reintroduce. My lesson learned is that lon lat coordinates in these extant data sources aren't suitable for that use in identifying wetlands, but I can provide the verbatim location descriptions to the park. [The more recent rangewide USGS surveys (not in gbif) only provide location descriptions in the public data, not coordinates, but the park itself can obtain the lat lon from USGS.]

As I explore these data for other species and other management issues, I'll get a better sense of whether this issue is specific to sensitive species, or if most of the data are only usable at scales of 1-2km or broader.

tphilippi avatar Nov 21 '19 18:11 tphilippi

I have code that parses those out to a set of 70 logical values. Would you like that as a stand-alone function ...?

I'd definitely be in support of exposing data quality issues in outputs so users can get to them more easily. I guess though that it'd be good to have those data in a common as possible format across data sources (for those data sources that have data quality flags). those that have them that I know of: gbif, idigbio (i'm sure theres others, just don't know off the top). In terms of how that's done, i'm not sure. I guess it'd be ideal if the data flags were in a column or more than one that will fit into a data.frame workflow to make it easy to filter data downstream similar to what we do in occ_issues

Thanks for the details on R. cascadae, very interesting stuff.

sckott avatar Nov 21 '19 19:11 sckott