odis-arch icon indicating copy to clipboard operation
odis-arch copied to clipboard

Select DwC fields to embed in JSON-LD records

Open pbuttigieg opened this issue 1 year ago • 5 comments

@pieterprovoost

Some non-redundant DwC properties can be of high value for ODIS-level discovery, but @pieterprovoost notes that a full embedding of all DwC fields is prohibitive (GB sized JSON-LD would be expected)

In this issue, we'll triage which DwC terms should be embedded in additionalProperty or 'variableMeasured' schema properties

pbuttigieg avatar Jun 24 '24 13:06 pbuttigieg

For reference

Classes

dwc:Dataset | dwc:Event | dwc:EventAttribute | dwc:EventMeasurement | dwc:FossilSpecimen | dwc:GeologicalContext | dwc:HumanObservation | dwc:Identification | dwc:LivingSpecimen | dcterms:Location | dwc:MachineObservation | dwc:MaterialCitation | dwc:MaterialEntity | dwc:MaterialSample | dwc:MeasurementOrFact | dwc:Occurrence | dwc:OccurrenceMeasurement | dwc:Organism | dwc:PreservedSpecimen | dwc:ResourceRelationship | dwc:Sample | dwc:SampleAttribute | dwc:SamplingEvent | dwc:SamplingLocation | dwc:Taxon

Record level

dwc:accordingTo | dwc:accuracy | dwc:basisOfRecord | dwc:collectionCode | dwc:collectionID | dwc:dataGeneralizations | dwc:datasetID | dwc:datasetName | dwc:DwCType | dwc:dynamicProperties | dwc:Generalizations | dwc:informationWithheld | dwc:institutionCode | dwc:institutionID | dwc:ownerInstitutionCode

Dublin Core legacy namespace

dc:language | dc:type

Dublin Core terms namespace

dcterms:accessRights | dcterms:bibliographicCitation | dcterms:language | dcterms:license | dcterms:modified | dcterms:references | dcterms:rights | dcterms:rightsHolder | dcterms:type

Occurrence

dwc:associatedMedia | dwc:associatedOccurrences | dwc:associatedReferences | dwc:associatedTaxa | dwc:behavior | dwc:caste | dwc:catalogNumber | dwc:CatalogNumberNumeric | dwc:degreeOfEstablishment | dwc:establishmentMeans | dwc:georeferenceVerificationStatus | dwc:individualCount | dwc:individualID | dwc:lifeStage | dwc:occurrenceAttributes | dwc:occurrenceDetails | dwc:occurrenceID | dwc:occurrenceRemarks | dwc:occurrenceStatus | dwc:organismQuantity | dwc:organismQuantityType | dwc:otherCatalogNumbers | dwc:pathway | dwc:recordedBy | dwc:recordedByID | dwc:recordNumber | dwc:reproductiveCondition | dwc:sex | dwc:vitality

Organism

dwc:associatedOrganisms | dwc:organismID | dwc:organismName | dwc:organismRemarks | dwc:organismScope | dwc:previousIdentifications

Material Entity

dwc:associatedSequences | dwc:disposition | dwc:materialEntityID | dwc:materialEntityRemarks | dwc:preparations | dwc:verbatimLabel

Material Sample

dwc:materialSampleID

Event

dwc:day | dwc:EarliestDateCollected | dwc:endDayOfYear | dwc:EndTimeOfDay | dwc:eventAttributes | dwc:eventDate | dwc:eventID | dwc:eventRemarks | dwc:eventTime | dwc:eventType | dwc:fieldNotes | dwc:fieldNumber | dwc:habitat | dwc:LatestDateCollected | dwc:month | dwc:parentEventID | dwc:sampleSizeUnit | dwc:sampleSizeValue | dwc:samplingEffort | dwc:samplingProtocol | dwc:startDayOfYear | dwc:StartTimeOfDay | dwc:verbatimEventDate | dwc:year

Location

dwc:continent | dwc:coordinatePrecision | dwc:coordinateUncertaintyInMeters | dwc:country | dwc:countryCode | dwc:county | dwc:decimalLatitude | dwc:decimalLongitude | dwc:footprintSpatialFit | dwc:footprintSRS | dwc:footprintWKT | dwc:geodeticDatum | dwc:georeferencedBy | dwc:georeferencedDate | dwc:georeferenceProtocol | dwc:georeferenceRemarks | dwc:georeferenceSources | dwc:higherGeography | dwc:higherGeographyID | dwc:island | dwc:islandGroup | dwc:locality | dwc:locationAccordingTo | dwc:locationAttributes | dwc:locationID | dwc:locationRemarks | dwc:maximumDepthInMeters | dwc:maximumDistanceAboveSurfaceInMeters | dwc:maximumElevationInMeters | dwc:minimumDepthInMeters | dwc:minimumDistanceAboveSurfaceInMeters | dwc:minimumElevationInMeters | dwc:municipality | dwc:pointRadiusSpatialFit | dwc:SamplingLocationID | dwc:SamplingLocationRemarks | dwc:stateProvince | dwc:verbatimCoordinates | dwc:verbatimCoordinateSystem | dwc:verbatimDepth | dwc:verbatimElevation | dwc:verbatimLatitude | dwc:verbatimLocality | dwc:verbatimLongitude | dwc:verbatimSRS | dwc:verticalDatum | dwc:waterBody

Geological Context

dwc:bed | dwc:earliestAgeOrLowestStage | dwc:earliestEonOrLowestEonothem | dwc:earliestEpochOrLowestSeries | dwc:earliestEraOrLowestErathem | dwc:earliestPeriodOrLowestSystem | dwc:formation | dwc:geologicalContextID | dwc:group | dwc:highestBiostratigraphicZone | dwc:latestAgeOrHighestStage | dwc:latestEonOrHighestEonothem | dwc:latestEpochOrHighestSeries | dwc:latestEraOrHighestErathem | dwc:latestPeriodOrHighestSystem | dwc:lithostratigraphicTerms | dwc:lowestBiostratigraphicZone | dwc:member

Identification

dwc:dateIdentified | dwc:identificationAttributes | dwc:identificationID | dwc:identificationQualifier | dwc:identificationReferences | dwc:identificationRemarks | dwc:identificationVerificationStatus | dwc:identifiedBy | dwc:identifiedByID | dwc:PreviousIdentifications | dwc:typeStatus | dwc:verbatimIdentification

Taxon

dwc:acceptedNameUsage | dwc:acceptedNameUsageID | dwc:acceptedScientificName | dwc:acceptedScientificNameID | dwc:AcceptedTaxon | dwc:AcceptedTaxonID | dwc:acceptedTaxonID | dwc:acceptedTaxonName | dwc:acceptedTaxonNameID | dwc:basionym | dwc:basionymID | dwc:binomial | dwc:class | dwc:cultivarEpithet | dwc:family | dwc:genericName | dwc:genus | dwc:higherClassification | dwc:HigherTaxon | dwc:higherTaxonconceptID | dwc:HigherTaxonID | dwc:higherTaxonName | dwc:higherTaxonNameID | dwc:infragenericEpithet | dwc:infraspecificEpithet | dwc:kingdom | dwc:nameAccordingTo | dwc:nameAccordingToID | dwc:namePublicationID | dwc:namePublishedIn | dwc:namePublishedInID | dwc:namePublishedInYear | dwc:nomenclaturalCode | dwc:nomenclaturalStatus | dwc:order | dwc:originalNameUsage | dwc:originalNameUsageID | dwc:parentNameUsage | dwc:parentNameUsageID | dwc:phylum | dwc:scientificName | dwc:scientificNameAuthorship | dwc:scientificNameID | dwc:scientificNameRank | dwc:specificEpithet | dwc:subfamily | dwc:subgenus | dwc:subtribe | dwc:superfamily | dwc:taxonAccordingTo | dwc:taxonAttributes | dwc:taxonConceptID | dwc:TaxonID | dwc:taxonID | dwc:taxonNameID | dwc:taxonomicStatus | dwc:taxonRank | dwc:taxonRemarks | dwc:tribe | dwc:verbatimScientificNameRank | dwc:verbatimTaxonRank | dwc:vernacularName

Measurement or Fact

dwc:measurementAccuracy | dwc:measurementDeterminedBy | dwc:measurementDeterminedDate | dwc:measurementID | dwc:measurementMethod | dwc:measurementRemarks | dwc:measurementType | dwc:measurementUnit | dwc:measurementValue | dwc:parentMeasurementID

Resource Relationship

dwc:RelatedBasisOfRecord | dwc:relatedResourceID | dwc:relatedResourceType | dwc:relationshipAccordingTo | dwc:relationshipEstablishedDate | dwc:relationshipOfResource | dwc:relationshipOfResourceID | dwc:relationshipRemarks | dwc:resourceID | dwc:resourceRelationshipID

IRI-value terms

dwciri:behavior | dwciri:caste | dwciri:dataGeneralizations | dwciri:degreeOfEstablishment | dwciri:disposition | dwciri:earliestGeochronologicalEra | dwciri:establishmentMeans | dwciri:eventType | dwciri:fieldNotes | dwciri:fieldNumber | dwciri:footprintSRS | dwciri:footprintWKT | dwciri:fromLithostratigraphicUnit | dwciri:geodeticDatum | dwciri:georeferencedBy | dwciri:georeferenceProtocol | dwciri:georeferenceSources | dwciri:georeferenceVerificationStatus | dwciri:habitat | dwciri:identificationQualifier | dwciri:identificationVerificationStatus | dwciri:identifiedBy | dwciri:inCollection | dwciri:inDataset | dwciri:inDescribedPlace | dwciri:informationWithheld | dwciri:latestGeochronologicalEra | dwciri:lifeStage | dwciri:locationAccordingTo | dwciri:measurementDeterminedBy | dwciri:measurementMethod | dwciri:measurementType | dwciri:measurementUnit | dwciri:measurementValue | dwciri:occurrenceStatus | dwciri:organismQuantityType | dwciri:pathway | dwciri:preparations | dwciri:recordedBy | dwciri:recordNumber | dwciri:reproductiveCondition | dwciri:sampleSizeUnit | dwciri:samplingProtocol | dwciri:sex | dwciri:toTaxon | dwciri:typeStatus | dwciri:verbatimCoordinateSystem | dwciri:verbatimSRS | dwciri:verticalDatum | dwciri:vitality

pbuttigieg avatar Jun 24 '24 13:06 pbuttigieg

@pieterprovoost here's a first triage from me.

Notes

  • There are some types that can be pushed to ODIS later (e.g. Event, Taxon), so I'm keeping some of the properties associated with them in this list. If OBIS is only pushing Datasets for now, then only those properties that make sense for that type are relevant.
    • Naturally, some of these can be embedded (i.e. "this dataset is about the following Taxa) and thus values from the relevant attributes can be added there).
    • This would also apply to some Event metadata, as they would be relevant for the Dataset too. They can be chained via potentialAction, but there may be more direct ways such as adding the date/time ranges of the event the dataset is about to the dataset metadata itself.
  • For Occurrence metadata, I suppose some of this will be embedded in Datasets, so some of those properties are included below. I don't think ODIS harvesting OBIS Occurrence metadata is on the cards yet.
  • there are multiple namespaces (dwc, dwciri, dc, dcterms, etc) I'm not adding all of these in the list below, but assume that they would be collapsed or both included in an export so something's not missed because of a trivial namespace mismatch.
  • The MeasurementOrFact elements can be tricky, as some may be variableMeasured and others some sort of additionalProperty or descriptive element. Maybe just adding them as additionalProperties is best for now. I don't include the properties relevant to these below. The are interesting, but could get massive if included. Exceptions for MIxS terms, or other standards embedded in DwC, perhaps.
  • The Resource / relationship terms are very interesting for linked data, but I think out of scope for now.
  • The Sample terms are also interesting if the Event Type is exported, but less so for Dataset metadata that we're concerned with now. Omitted here.

Add to embedding in ODIS records (for taxomonic levels, @pieterprovoost noted these may only go down to order or family to not flood the metadata, the rest would be available in the OBIS records):

  • [ ] dwc:acceptedNameUsage
  • [ ] dwc:acceptedNameUsageID
  • [ ] dwc:associatedSequences
  • [ ] dwc:associatedTaxa
  • [ ] dwc:bed
  • [ ] dwciri:behavior / dwc:behavior
  • [ ] dwc:class
  • [ ] dwc:degreeOfEstablishment
  • [ ] dwc:family
  • [ ] dwc:fieldNotes
  • [ ] dwc:fieldNumber
  • [ ] dwc:genericName
  • [ ] dwc:genus
  • [ ] dwc:GeologicalContext
  • [ ] dwc:habitat
  • [ ] dwc:higherClassification
  • [ ] dwc:identifiedBy
  • [ ] dwc:informationWithheld
  • [ ] dwc:kingdom
  • [ ] dc:language
  • [ ] dwc:MaterialEntity
  • [ ] dwc:MaterialSample (cross-links to #376)
  • [ ] dwc:materialSampleID
  • [ ] dwc:nomenclaturalCode
  • [ ] dwc:Occurrence
  • [ ] dwc:occurrenceDetails
  • [ ] dwc:occurrenceRemarks
  • [ ] dwc:order
  • [ ] dwc:originalNameUsage
  • [ ] dwc:phylum
  • [ ] dwc:scientificName
  • [ ] dwc:superfamily
  • [ ] dwc:taxonAttributes
  • [ ] dwc:taxonID
  • [ ] dwc:verbatimIdentification
  • [ ] dwc:vernacularName

Map to schema.org properties

  • [ ] dwc:accessRights
  • [ ] dwc:associatedMedia
  • [ ] dwc:associatedReferences
  • [ ] dcterms:bibliographicCitation
  • [ ] dwc:continent
  • [ ] dwc:country / dwc:countryCode
  • [ ] dwc:county
  • [ ] dwc:dataGeneralizations (to additonal description or similar)
  • [ ] dwc:datasetID
  • [ ] dwc:datasetName
  • [ ] dwc:day
  • [ ] dwc:endDayOfYear
  • [ ] dwc:establishmentMeans
  • [ ] dwc:eventDate
  • [ ] dwc:eventID
  • [ ] dwc:eventRemarks (comment or description on schema:Event)
  • [ ] dwc:eventTime
  • [ ] dwc:GeologicalContext
  • [ ] dwc:higherGeography
  • [ ] dwciri:inDataset (subjectOf)
  • [ ] dwc:institutionCode
  • [ ] dwc:institutionID
  • [ ] dcterms:license
  • [ ] dwc:measurementMethod
  • [ ] dcterms:modified (sd properties)
  • [ ] dwc:month
  • [ ] dcterms:references
  • [ ] dwc:relatedResourceID
  • [ ] dcterms:rights
  • [ ] dcterms:rightsHolder
  • [ ] dwc:startDayOfYear
  • [ ] dc:type
  • [ ] dwc:year

Spatial mapping to GeoJSON and/or schema.org spatial properties in their stanzas, may be a bit involved, but worth it (I assume many of these are already mapped):

  • [ ] dwc:coordinatePrecision
  • [ ] dwc:coordinateUncertaintyInMeters
  • [ ] dwc:decimalLatitude
  • [ ] dwc:decimalLongitude
  • [ ] dwc:footprintSRS
  • [ ] dwc:footprintWKT
  • [ ] dwc:geodeticDatum
  • [ ] dwc:locality
  • [ ] dwc:locationRemarks
  • [ ] dwc:maximumDepthInMeters (xref #377)
  • [ ] dwc:maximumDistanceAboveSurfaceInMeters
  • [ ] dwc:maximumElevationInMeters
  • [ ] dwc:minimumDepthInMeters
  • [ ] dwc:minimumDistanceAboveSurfaceInMeters
  • [ ] dwc:minimumElevationInMeters
  • [ ] dwc:municipality
  • [ ] dwc:stateProvince
  • [ ] dwc:verbatimCoordinates
  • [ ] dwc:verbatimCoordinateSystem
  • [ ] dwc:verbatimDepth
  • [ ] dwc:verbatimElevation
  • [ ] dwc:verbatimEventDate
  • [ ] dwc:verbatimLatitude
  • [ ] dwc:verbatimLocality
  • [ ] dwc:verbatimLongitude
  • [ ] dwc:verbatimSRS
  • [ ] dwc:verticalDatum
  • [ ] dwc:waterBody

pbuttigieg avatar Jun 24 '24 14:06 pbuttigieg

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

pieterprovoost avatar Jun 24 '24 15:06 pieterprovoost

Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?

In the sense that a Dataset can be about potentially thousands of Taxa? Aggregation at higher ranks I think.

For Events, maybe taking the extreme values of space and time and creating an inclusive pocket to push to the Dataset metadata.

For things that hang off of Occurrence, like dwc:habitat, that's trickier - arrays in Dataset properties like about come to mind, but this may be one too many jumps.

If OBIS eventually releases truncated metadata about the other types (Events, Taxa, maybe Occurrences for specific species [e.g. of concern, keystones, invasives]) this would of course be easier from the Dataset metadata (via @id referencing). Maybe that can wait for that stage.

These are fields that I think would be useful for ODIS-level discovery of OBIS resources - if adding them is prohibitively complex or would put prohibitive demands on the systems involved, we can mark them for later consideration.

Could you check mark the terms above that you think are the most feasible to add now? We can discuss how to add some high-value ones that are harder in a meeting perhaps.

pbuttigieg avatar Jun 24 '24 15:06 pbuttigieg

And I'm quite sure that some of the DwC value syntax will conflict with schema.org / OGC constraints - that's important to note, even if those properties don't make it into the JSON-LD/schema.org products. Those are a basis to trigger later alignment of the standards themselves, hopefully.

pbuttigieg avatar Jun 24 '24 15:06 pbuttigieg