taxonworks icon indicating copy to clipboard operation
taxonworks copied to clipboard

Task - Import DWCA Checklist

Open mjy opened this issue 7 years ago • 19 comments

mjy avatar Jan 08 '18 16:01 mjy

@LocoDelAssembly This is likely your second priority. At it's heart we're looking for a parent/child format with an extension column for NOMEN relationship (and possibly extension column for NOMEN status... but this is less priority).

Use the TWDG headers, but they should semantically be:

id parent_id name author_year nomen_relationship

The batch loaders are all derived from lib/batch_loader code. This will be a taxon_names batch loader.

I'm sure there will be more questions. We're looking to handle maybe upto 2500 names at a time. We may need delayed job or similar, but the primary goal is to enable the batch load, secondary is to make it performant or have a reasonable feedback for many names (e.g. contact user when done, etc.).

mjy avatar Apr 06 '19 18:04 mjy

The VASCAN example dataset

Fields are here, in two blocks or priority. See also http://rs.tdwg.org/dwc/terms/#taxon.

Base

id
taxonID
acceptedNameUsageID
parentNameUsageID
nameAccordingToID
scientificName
class
order
family
genus
subgenus
specificEpithet
infraspecificEpithet
taxonRank
scientificNameAuthorship

Secondary priority

acceptedNameUsage
parentNameUsage
nameAccordingTo
higherClassification
taxonomicStatus
modified
license
bibliographicCitation
references

In addition we need several modifications that allow us to integrate NOMEN data, these are my suggestions:

  • When I see a NOMEN URI in http://rs.tdwg.org/dwc/terms/#dwc:nomenclaturalStatus, then I add a corresponding TaxonNameClassification in TaxonWorks.
  • A seondary consideration: If I have an ambiguous (i.e. non URI based) status in nomenclaturalStatus then I could try to map it to an umabiguous (NOMEN) class. We may need to setup a YAML or other mapping file as a constant in the NOMEN repo to map text strings to proposed NOMEN classes.
  • This is less clear, but: When I see a NOMEN URI in taxonomicStatus then I create a TaxonNameRelationship in TaxonWorks. I must always have a acceptedNameUsageId to use as the object in the relationship.
  • Similar secondary considerations exist for these as for nomenclaturalStatus

mjy avatar Jul 10 '19 17:07 mjy

@LocoDelAssembly I'm going to throw other considerations here as they come to mind:

  • [ ] User provided IDs could (should) be added as local identifiers if user selects a namespace on import
  • We might need to add features to the batch load model that are delayed-job (see the gem) like, for longer running jobs. Alternatively we can limit the load to say 500-1000 lines at a time.
  • We might consider various pre-import options.
    • For example turning on/off comparisons to existing data rather than just assuming everything is new. In the latter case we can just write everything, even if duplicate.

mjy avatar Jul 11 '19 16:07 mjy

What is the status of the issue?

cgendreau avatar Dec 05 '19 21:12 cgendreau

@LocoDelAssembly Can you chime in here with where you are at for the Specimen side of things?

mjy avatar Dec 05 '19 21:12 mjy

I can import collection objects with their OTU+taxon names and geo reference. I hope to put it all together soon, after merging rails6 and preferably biodiversity4 too.

LocoDelAssembly avatar Dec 06 '19 13:12 LocoDelAssembly

@LocoDelAssembly @cgendreau is in part with the folks from Vascan. I believe one of your test datasets for the code you are writing is the Vascan dataset linked above?

@cgendreau if indeed that dataset/format is of primary interest to you let us know, otherwise we'll also focus on the generic checklist format as well.

mjy avatar Dec 06 '19 14:12 mjy

For checklist yes, I worked with the dataset above.

LocoDelAssembly avatar Dec 06 '19 16:12 LocoDelAssembly

yes the "The VASCAN example dataset" is what I was looking for. In other words, could we load it in TaxonWorks and start curating it?

cgendreau avatar Dec 06 '19 16:12 cgendreau

I leave @LocoDelAssembly to comment. I will say we're literally today hoping to move to Rails6, and I think Biodiversity4 PR is OK too.

@LocoDelAssembly It could be that if your code works as Rake, rather than integrated, that we could get their data into a sandbox for them to play with? I'd really like to support their group.

mjy avatar Dec 06 '19 16:12 mjy

@cgendreau Spoke with @LocoDelAssembly et al. last week- targetting ~ 1 month for the importer. Sorry for the delays!

mjy avatar Jan 16 '20 19:01 mjy

@cgendreau FYI, if it's still of interest, we have people actively testing the importer. Any feedback would be welcome a this point. We have weekly informal digitization meetings whose current focus includes this issue, among others. Ping me if you want to get more information on the logistics of joining those sessions.

mjy avatar Sep 25 '20 15:09 mjy

@LocoDelAssembly in that light, perhaps you can throw the VASCAN dataset up on sandworm?

mjy avatar Sep 25 '20 15:09 mjy

It is always of interest 👍 @CaroleSinou

cgendreau avatar Sep 25 '20 19:09 cgendreau

This will accumulate questions/answers:

https://github.com/SpeciesFileGroup/taxonworks_doc/issues/29

mjy avatar Sep 25 '20 19:09 mjy

@cgendreau Further note that we've loaded the VASCAN dataset into the sandbox where we are testing the importer. Shoot me an email if you want access details etc. etc.

mjy avatar Oct 13 '20 14:10 mjy

is it done?

proceps avatar Aug 06 '21 19:08 proceps

@LocoDelAssembly @LordFlashmeow can you re-summarize what we have available for the checklist format importer that is hidden away in the occurrence UI? Should we make the existing functionality more readily available, close this, and target specific issues like #2737?

mjy avatar Jun 18 '22 15:06 mjy