taxonworks
taxonworks copied to clipboard
Task - Import DWCA Checklist
@LocoDelAssembly This is likely your second priority. At it's heart we're looking for a parent/child format with an extension column for NOMEN relationship (and possibly extension column for NOMEN status... but this is less priority).
Use the TWDG headers, but they should semantically be:
id | parent_id | name | author_year | nomen_relationship |
---|
The batch loaders are all derived from lib/batch_loader code. This will be a taxon_names batch loader.
I'm sure there will be more questions. We're looking to handle maybe upto 2500 names at a time. We may need delayed job or similar, but the primary goal is to enable the batch load, secondary is to make it performant or have a reasonable feedback for many names (e.g. contact user when done, etc.).
Fields are here, in two blocks or priority. See also http://rs.tdwg.org/dwc/terms/#taxon.
Base
id
taxonID
acceptedNameUsageID
parentNameUsageID
nameAccordingToID
scientificName
class
order
family
genus
subgenus
specificEpithet
infraspecificEpithet
taxonRank
scientificNameAuthorship
Secondary priority
acceptedNameUsage
parentNameUsage
nameAccordingTo
higherClassification
taxonomicStatus
modified
license
bibliographicCitation
references
In addition we need several modifications that allow us to integrate NOMEN data, these are my suggestions:
- When I see a NOMEN URI in
http://rs.tdwg.org/dwc/terms/#dwc:nomenclaturalStatus
, then I add a corresponding TaxonNameClassification in TaxonWorks. - A seondary consideration: If I have an ambiguous (i.e. non URI based) status in
nomenclaturalStatus
then I could try to map it to an umabiguous (NOMEN) class. We may need to setup a YAML or other mapping file as a constant in the NOMEN repo to map text strings to proposed NOMEN classes. - This is less clear, but: When I see a NOMEN URI in taxonomicStatus then I create a TaxonNameRelationship in TaxonWorks. I must always have a acceptedNameUsageId to use as the object in the relationship.
- Similar secondary considerations exist for these as for
nomenclaturalStatus
@LocoDelAssembly I'm going to throw other considerations here as they come to mind:
- [ ] User provided IDs could (should) be added as local identifiers if user selects a namespace on import
- We might need to add features to the batch load model that are delayed-job (see the gem) like, for longer running jobs. Alternatively we can limit the load to say 500-1000 lines at a time.
- We might consider various pre-import options.
- For example turning on/off comparisons to existing data rather than just assuming everything is new. In the latter case we can just write everything, even if duplicate.
What is the status of the issue?
@LocoDelAssembly Can you chime in here with where you are at for the Specimen side of things?
I can import collection objects with their OTU+taxon names and geo reference. I hope to put it all together soon, after merging rails6 and preferably biodiversity4 too.
@LocoDelAssembly @cgendreau is in part with the folks from Vascan. I believe one of your test datasets for the code you are writing is the Vascan dataset linked above?
@cgendreau if indeed that dataset/format is of primary interest to you let us know, otherwise we'll also focus on the generic checklist format as well.
For checklist yes, I worked with the dataset above.
yes the "The VASCAN example dataset" is what I was looking for. In other words, could we load it in TaxonWorks and start curating it?
I leave @LocoDelAssembly to comment. I will say we're literally today hoping to move to Rails6, and I think Biodiversity4 PR is OK too.
@LocoDelAssembly It could be that if your code works as Rake, rather than integrated, that we could get their data into a sandbox for them to play with? I'd really like to support their group.
@cgendreau Spoke with @LocoDelAssembly et al. last week- targetting ~ 1 month for the importer. Sorry for the delays!
@cgendreau FYI, if it's still of interest, we have people actively testing the importer. Any feedback would be welcome a this point. We have weekly informal digitization meetings whose current focus includes this issue, among others. Ping me if you want to get more information on the logistics of joining those sessions.
@LocoDelAssembly in that light, perhaps you can throw the VASCAN dataset up on sandworm?
It is always of interest 👍 @CaroleSinou
This will accumulate questions/answers:
https://github.com/SpeciesFileGroup/taxonworks_doc/issues/29
@cgendreau Further note that we've loaded the VASCAN dataset into the sandbox where we are testing the importer. Shoot me an email if you want access details etc. etc.
is it done?
@LocoDelAssembly @LordFlashmeow can you re-summarize what we have available for the checklist format importer that is hidden away in the occurrence UI? Should we make the existing functionality more readily available, close this, and target specific issues like #2737?