taxonworks
taxonworks copied to clipboard
DWCA Importer - Geographic Areas not imported
Can we use combo of Country, State, County to match to one/two default Geographic Area in TW Gazeteer?
Alternatively, we could have them select one Gazeteer to match against.
This is mainly an issue when specimens have extremely little location data (e.g. "California") and their collecting event is imported as almost empty, since Geographic Area is not imported. In these cases, at the very least, if no geographic area imported, then whatever geographic areas are indicated in "Country/State/County" should be imported to verbatim Locality to create a slightly better Collecting Event
The extent of Collecting Event information reported in DWCA table:
Example "bad" collecting event created:
Is this done? Current imports don't seem like they match to Geoareas. @mjy @LocoDelAssembly
No, it is still resolving through latitude/longitude.
Would this require something similar to the namespace mapper to select which GeographicArea is each chunk of text?
Maybe? I'd be fine with that personally, but it should definitely be an option, not required. Could have an option to "auto-create" if georeferences are included (solution for Brian's large dataset)? I've only got 12 areas in my latest dataset, and many of the ones I'm importing only have one or two geographic areas total so a selector would be really easy.
What if we do something like this?
Where we just specify the Geographic Area ID from Taxonworks?
Shouldn't we use Global id for the geographic area?
@adrik29 really wants this for a large dataset import
@tmcelrath -- I sure do. It would be desirable to maximize the fields you can import, once we have all the trouble to prepare the foreign databases, at least you can make it count.
Easier to implement (but at the same time reasonable for the user) would be using globally unique IDs and possibly allow to scope to a specific gazetteer (TDWG, NE, GADM).
Works for me!
@LocoDelAssembly Let's revisit this please. At minimum I think we need one of these options. For reference here are (some) of the fields in question:
# locationID: [Not mapped]
# higherGeographyID: [Not mapped]
# higherGeography: [Not mapped]
# continent: [Not mapped]
# waterBody: [Not mapped]
# islandGroup: [Not mapped]
# island: [Not mapped]
# country: [Not mapped]
# countryCode: [Not mapped]
# stateProvince: [Not mapped]
# county: [Not mapped]
# municipality: [Not mapped]
# locality: [Not mapped]
Potential options:
1 - @tmcelrath's suggestion for a custom Geographic area ID should be straightforward to implement I hope. Maybe we could add a UI checkbox to turn that option (or any of below on/off, or something like "Geographic area mode")
2 - I wonder if we can do a lookup option, if user provides any of country/state/county and we have an exact match within geographic areas (GeographicArea.with_name_and_parent_names(['Champaign', 'Illinois', 'United States'])
). We would use the fields in this case to be ['county', 'stateProvince', 'country'].compact
. Feels like this could be the easiest to implement.
3 - One mode could be "higherGeographyID
matches TaxonWorks GeographicArea ID"? This would be a one-to-one mapping.
4 - We could provide an option to do a post-import delayed job setup where GeographicArea is set by Georeference. Specific gazeteers could be supplied. I have the code for this type of lookup coming on the cached map branch.
5 - We could/should likely add an import mode that adds the fields above as data-attributes if selected.
6 - We could write a new task that seeks to assign a GeographicArea from some combination of attributes (again, possible "Mode" here). The scope can be limited to a CollectingEvent filter query. For example:
- Match on Georeference
- Match on DataAttributes
- Match on ... ?
Considerations
- Validation times on/off during import (doing the spatial checks)
- @jlpereira we could write a new task that batch matches Georeferenced records to a GeographicArea to assign them.
@bpescador, @ChrisGrinter, @rich-keller, @debpaul
In almost all cases, number 2 above should work. It would be interesting to see how many of our entries are not an exact match from other gazetteers.
Any of the above would work. I will say we need to make sure nothing is matched to the geographic areas without shapes e.g. https://sfg.taxonworks.org/geographic_areas/26421
I would support number 2 for sure but then either number 1 or 3 for datasets where the geoarea is all the same or there are a limited number of geoareas to set. E.g. most of the datasets I've done so far have 1 or 2 unique Geographic Areas in the whole dataset, and manually doing them pre-import would save massive amounts of time.
Good call with the consideration for preference/requirement of shapes.
@LocoDelAssembly Maybe we need to have Redis caching to speed some aspects of the import up (i.e. once found don't hit DB but rather a memory store)?
With something like #2 (the lookup option), it seems like there should be an optional setting to specify which gazetteer(s) to search. (@LocoDelAssembly also mentioned scoping by gazetteer above.) A given project may want to standardize on a specific gazetteer for all their CEs. (In fact, specifying a standard gazetteer for a project would also simplify filtering by geographic area in the UI, where it is confusing to have multiple counties/states/countries returned from multiple gazetteers. But this is a different issue...) I also like the flexibility to map explicitly to a specific geographic area ID as @tmcelrath suggested.