taxonworks icon indicating copy to clipboard operation
taxonworks copied to clipboard

DWCA Importer - Geographic Areas not imported

Open tmcelrath opened this issue 4 years ago • 9 comments

Can we use combo of Country, State, County to match to one/two default Geographic Area in TW Gazeteer?

Alternatively, we could have them select one Gazeteer to match against.

This is mainly an issue when specimens have extremely little location data (e.g. "California") and their collecting event is imported as almost empty, since Geographic Area is not imported. In these cases, at the very least, if no geographic area imported, then whatever geographic areas are indicated in "Country/State/County" should be imported to verbatim Locality to create a slightly better Collecting Event

The extent of Collecting Event information reported in DWCA table: image

Example "bad" collecting event created: image

tmcelrath avatar Oct 28 '20 04:10 tmcelrath

Is this done? Current imports don't seem like they match to Geoareas. @mjy @LocoDelAssembly

tmcelrath avatar Aug 10 '21 19:08 tmcelrath

No, it is still resolving through latitude/longitude.

Would this require something similar to the namespace mapper to select which GeographicArea is each chunk of text?

LocoDelAssembly avatar Aug 10 '21 19:08 LocoDelAssembly

Maybe? I'd be fine with that personally, but it should definitely be an option, not required. Could have an option to "auto-create" if georeferences are included (solution for Brian's large dataset)? I've only got 12 areas in my latest dataset, and many of the ones I'm importing only have one or two geographic areas total so a selector would be really easy.

tmcelrath avatar Aug 10 '21 19:08 tmcelrath

What if we do something like this? image Where we just specify the Geographic Area ID from Taxonworks?

tmcelrath avatar Aug 16 '22 15:08 tmcelrath

Shouldn't we use Global id for the geographic area?

adrik29 avatar Aug 16 '22 15:08 adrik29

@adrik29 really wants this for a large dataset import

tmcelrath avatar Aug 16 '22 17:08 tmcelrath

@tmcelrath -- I sure do. It would be desirable to maximize the fields you can import, once we have all the trouble to prepare the foreign databases, at least you can make it count.

adrik29 avatar Aug 16 '22 17:08 adrik29

Easier to implement (but at the same time reasonable for the user) would be using globally unique IDs and possibly allow to scope to a specific gazetteer (TDWG, NE, GADM).

LocoDelAssembly avatar Aug 16 '22 17:08 LocoDelAssembly

Works for me!

tmcelrath avatar Aug 16 '22 17:08 tmcelrath

@LocoDelAssembly Let's revisit this please. At minimum I think we need one of these options. For reference here are (some) of the fields in question:

    # locationID: [Not mapped]
    # higherGeographyID: [Not mapped]
    # higherGeography: [Not mapped]
    # continent: [Not mapped]
    # waterBody: [Not mapped]
    # islandGroup: [Not mapped]
    # island: [Not mapped]
    # country: [Not mapped]
    # countryCode: [Not mapped]
    # stateProvince: [Not mapped]
    # county: [Not mapped]
    # municipality: [Not mapped]
    # locality: [Not mapped]

Potential options:

1 - @tmcelrath's suggestion for a custom Geographic area ID should be straightforward to implement I hope. Maybe we could add a UI checkbox to turn that option (or any of below on/off, or something like "Geographic area mode")

2 - I wonder if we can do a lookup option, if user provides any of country/state/county and we have an exact match within geographic areas (GeographicArea.with_name_and_parent_names(['Champaign', 'Illinois', 'United States'])). We would use the fields in this case to be ['county', 'stateProvince', 'country'].compact. Feels like this could be the easiest to implement.

3 - One mode could be "higherGeographyID matches TaxonWorks GeographicArea ID"? This would be a one-to-one mapping.

4 - We could provide an option to do a post-import delayed job setup where GeographicArea is set by Georeference. Specific gazeteers could be supplied. I have the code for this type of lookup coming on the cached map branch.

5 - We could/should likely add an import mode that adds the fields above as data-attributes if selected.

6 - We could write a new task that seeks to assign a GeographicArea from some combination of attributes (again, possible "Mode" here). The scope can be limited to a CollectingEvent filter query. For example:

  • Match on Georeference
  • Match on DataAttributes
  • Match on ... ?

Considerations

  • Validation times on/off during import (doing the spatial checks)
  • @jlpereira we could write a new task that batch matches Georeferenced records to a GeographicArea to assign them.

@bpescador, @ChrisGrinter, @rich-keller, @debpaul

mjy avatar Apr 07 '23 14:04 mjy

In almost all cases, number 2 above should work. It would be interesting to see how many of our entries are not an exact match from other gazetteers.

bpescador avatar Apr 07 '23 15:04 bpescador

Any of the above would work. I will say we need to make sure nothing is matched to the geographic areas without shapes e.g. https://sfg.taxonworks.org/geographic_areas/26421

I would support number 2 for sure but then either number 1 or 3 for datasets where the geoarea is all the same or there are a limited number of geoareas to set. E.g. most of the datasets I've done so far have 1 or 2 unique Geographic Areas in the whole dataset, and manually doing them pre-import would save massive amounts of time.

tmcelrath avatar Apr 07 '23 15:04 tmcelrath

Good call with the consideration for preference/requirement of shapes.

@LocoDelAssembly Maybe we need to have Redis caching to speed some aspects of the import up (i.e. once found don't hit DB but rather a memory store)?

mjy avatar Apr 07 '23 15:04 mjy

With something like #2 (the lookup option), it seems like there should be an optional setting to specify which gazetteer(s) to search. (@LocoDelAssembly also mentioned scoping by gazetteer above.) A given project may want to standardize on a specific gazetteer for all their CEs. (In fact, specifying a standard gazetteer for a project would also simplify filtering by geographic area in the UI, where it is confusing to have multiple counties/states/countries returned from multiple gazetteers. But this is a different issue...) I also like the flexibility to map explicitly to a specific geographic area ID as @tmcelrath suggested.

rich-keller avatar Apr 11 '23 23:04 rich-keller