Select DwC fields to embed in JSON-LD records
@pieterprovoost
Some non-redundant DwC properties can be of high value for ODIS-level discovery, but @pieterprovoost notes that a full embedding of all DwC fields is prohibitive (GB sized JSON-LD would be expected)
In this issue, we'll triage which DwC terms should be embedded in additionalProperty or 'variableMeasured' schema properties
For reference
Classes
dwc:Dataset | dwc:Event | dwc:EventAttribute | dwc:EventMeasurement | dwc:FossilSpecimen | dwc:GeologicalContext | dwc:HumanObservation | dwc:Identification | dwc:LivingSpecimen | dcterms:Location | dwc:MachineObservation | dwc:MaterialCitation | dwc:MaterialEntity | dwc:MaterialSample | dwc:MeasurementOrFact | dwc:Occurrence | dwc:OccurrenceMeasurement | dwc:Organism | dwc:PreservedSpecimen | dwc:ResourceRelationship | dwc:Sample | dwc:SampleAttribute | dwc:SamplingEvent | dwc:SamplingLocation | dwc:Taxon
Record level
dwc:accordingTo | dwc:accuracy | dwc:basisOfRecord | dwc:collectionCode | dwc:collectionID | dwc:dataGeneralizations | dwc:datasetID | dwc:datasetName | dwc:DwCType | dwc:dynamicProperties | dwc:Generalizations | dwc:informationWithheld | dwc:institutionCode | dwc:institutionID | dwc:ownerInstitutionCode
Dublin Core legacy namespace
dc:language | dc:type
Dublin Core terms namespace
dcterms:accessRights | dcterms:bibliographicCitation | dcterms:language | dcterms:license | dcterms:modified | dcterms:references | dcterms:rights | dcterms:rightsHolder | dcterms:type
Occurrence
dwc:associatedMedia | dwc:associatedOccurrences | dwc:associatedReferences | dwc:associatedTaxa | dwc:behavior | dwc:caste | dwc:catalogNumber | dwc:CatalogNumberNumeric | dwc:degreeOfEstablishment | dwc:establishmentMeans | dwc:georeferenceVerificationStatus | dwc:individualCount | dwc:individualID | dwc:lifeStage | dwc:occurrenceAttributes | dwc:occurrenceDetails | dwc:occurrenceID | dwc:occurrenceRemarks | dwc:occurrenceStatus | dwc:organismQuantity | dwc:organismQuantityType | dwc:otherCatalogNumbers | dwc:pathway | dwc:recordedBy | dwc:recordedByID | dwc:recordNumber | dwc:reproductiveCondition | dwc:sex | dwc:vitality
Organism
dwc:associatedOrganisms | dwc:organismID | dwc:organismName | dwc:organismRemarks | dwc:organismScope | dwc:previousIdentifications
Material Entity
dwc:associatedSequences | dwc:disposition | dwc:materialEntityID | dwc:materialEntityRemarks | dwc:preparations | dwc:verbatimLabel
Material Sample
dwc:materialSampleID
Event
dwc:day | dwc:EarliestDateCollected | dwc:endDayOfYear | dwc:EndTimeOfDay | dwc:eventAttributes | dwc:eventDate | dwc:eventID | dwc:eventRemarks | dwc:eventTime | dwc:eventType | dwc:fieldNotes | dwc:fieldNumber | dwc:habitat | dwc:LatestDateCollected | dwc:month | dwc:parentEventID | dwc:sampleSizeUnit | dwc:sampleSizeValue | dwc:samplingEffort | dwc:samplingProtocol | dwc:startDayOfYear | dwc:StartTimeOfDay | dwc:verbatimEventDate | dwc:year
Location
dwc:continent | dwc:coordinatePrecision | dwc:coordinateUncertaintyInMeters | dwc:country | dwc:countryCode | dwc:county | dwc:decimalLatitude | dwc:decimalLongitude | dwc:footprintSpatialFit | dwc:footprintSRS | dwc:footprintWKT | dwc:geodeticDatum | dwc:georeferencedBy | dwc:georeferencedDate | dwc:georeferenceProtocol | dwc:georeferenceRemarks | dwc:georeferenceSources | dwc:higherGeography | dwc:higherGeographyID | dwc:island | dwc:islandGroup | dwc:locality | dwc:locationAccordingTo | dwc:locationAttributes | dwc:locationID | dwc:locationRemarks | dwc:maximumDepthInMeters | dwc:maximumDistanceAboveSurfaceInMeters | dwc:maximumElevationInMeters | dwc:minimumDepthInMeters | dwc:minimumDistanceAboveSurfaceInMeters | dwc:minimumElevationInMeters | dwc:municipality | dwc:pointRadiusSpatialFit | dwc:SamplingLocationID | dwc:SamplingLocationRemarks | dwc:stateProvince | dwc:verbatimCoordinates | dwc:verbatimCoordinateSystem | dwc:verbatimDepth | dwc:verbatimElevation | dwc:verbatimLatitude | dwc:verbatimLocality | dwc:verbatimLongitude | dwc:verbatimSRS | dwc:verticalDatum | dwc:waterBody
Geological Context
dwc:bed | dwc:earliestAgeOrLowestStage | dwc:earliestEonOrLowestEonothem | dwc:earliestEpochOrLowestSeries | dwc:earliestEraOrLowestErathem | dwc:earliestPeriodOrLowestSystem | dwc:formation | dwc:geologicalContextID | dwc:group | dwc:highestBiostratigraphicZone | dwc:latestAgeOrHighestStage | dwc:latestEonOrHighestEonothem | dwc:latestEpochOrHighestSeries | dwc:latestEraOrHighestErathem | dwc:latestPeriodOrHighestSystem | dwc:lithostratigraphicTerms | dwc:lowestBiostratigraphicZone | dwc:member
Identification
dwc:dateIdentified | dwc:identificationAttributes | dwc:identificationID | dwc:identificationQualifier | dwc:identificationReferences | dwc:identificationRemarks | dwc:identificationVerificationStatus | dwc:identifiedBy | dwc:identifiedByID | dwc:PreviousIdentifications | dwc:typeStatus | dwc:verbatimIdentification
Taxon
dwc:acceptedNameUsage | dwc:acceptedNameUsageID | dwc:acceptedScientificName | dwc:acceptedScientificNameID | dwc:AcceptedTaxon | dwc:AcceptedTaxonID | dwc:acceptedTaxonID | dwc:acceptedTaxonName | dwc:acceptedTaxonNameID | dwc:basionym | dwc:basionymID | dwc:binomial | dwc:class | dwc:cultivarEpithet | dwc:family | dwc:genericName | dwc:genus | dwc:higherClassification | dwc:HigherTaxon | dwc:higherTaxonconceptID | dwc:HigherTaxonID | dwc:higherTaxonName | dwc:higherTaxonNameID | dwc:infragenericEpithet | dwc:infraspecificEpithet | dwc:kingdom | dwc:nameAccordingTo | dwc:nameAccordingToID | dwc:namePublicationID | dwc:namePublishedIn | dwc:namePublishedInID | dwc:namePublishedInYear | dwc:nomenclaturalCode | dwc:nomenclaturalStatus | dwc:order | dwc:originalNameUsage | dwc:originalNameUsageID | dwc:parentNameUsage | dwc:parentNameUsageID | dwc:phylum | dwc:scientificName | dwc:scientificNameAuthorship | dwc:scientificNameID | dwc:scientificNameRank | dwc:specificEpithet | dwc:subfamily | dwc:subgenus | dwc:subtribe | dwc:superfamily | dwc:taxonAccordingTo | dwc:taxonAttributes | dwc:taxonConceptID | dwc:TaxonID | dwc:taxonID | dwc:taxonNameID | dwc:taxonomicStatus | dwc:taxonRank | dwc:taxonRemarks | dwc:tribe | dwc:verbatimScientificNameRank | dwc:verbatimTaxonRank | dwc:vernacularName
Measurement or Fact
dwc:measurementAccuracy | dwc:measurementDeterminedBy | dwc:measurementDeterminedDate | dwc:measurementID | dwc:measurementMethod | dwc:measurementRemarks | dwc:measurementType | dwc:measurementUnit | dwc:measurementValue | dwc:parentMeasurementID
Resource Relationship
dwc:RelatedBasisOfRecord | dwc:relatedResourceID | dwc:relatedResourceType | dwc:relationshipAccordingTo | dwc:relationshipEstablishedDate | dwc:relationshipOfResource | dwc:relationshipOfResourceID | dwc:relationshipRemarks | dwc:resourceID | dwc:resourceRelationshipID
IRI-value terms
dwciri:behavior | dwciri:caste | dwciri:dataGeneralizations | dwciri:degreeOfEstablishment | dwciri:disposition | dwciri:earliestGeochronologicalEra | dwciri:establishmentMeans | dwciri:eventType | dwciri:fieldNotes | dwciri:fieldNumber | dwciri:footprintSRS | dwciri:footprintWKT | dwciri:fromLithostratigraphicUnit | dwciri:geodeticDatum | dwciri:georeferencedBy | dwciri:georeferenceProtocol | dwciri:georeferenceSources | dwciri:georeferenceVerificationStatus | dwciri:habitat | dwciri:identificationQualifier | dwciri:identificationVerificationStatus | dwciri:identifiedBy | dwciri:inCollection | dwciri:inDataset | dwciri:inDescribedPlace | dwciri:informationWithheld | dwciri:latestGeochronologicalEra | dwciri:lifeStage | dwciri:locationAccordingTo | dwciri:measurementDeterminedBy | dwciri:measurementMethod | dwciri:measurementType | dwciri:measurementUnit | dwciri:measurementValue | dwciri:occurrenceStatus | dwciri:organismQuantityType | dwciri:pathway | dwciri:preparations | dwciri:recordedBy | dwciri:recordNumber | dwciri:reproductiveCondition | dwciri:sampleSizeUnit | dwciri:samplingProtocol | dwciri:sex | dwciri:toTaxon | dwciri:typeStatus | dwciri:verbatimCoordinateSystem | dwciri:verbatimSRS | dwciri:verticalDatum | dwciri:vitality
@pieterprovoost here's a first triage from me.
Notes
- There are some types that can be pushed to ODIS later (e.g. Event, Taxon), so I'm keeping some of the properties associated with them in this list. If OBIS is only pushing Datasets for now, then only those properties that make sense for that type are relevant.
- Naturally, some of these can be embedded (i.e. "this dataset is
aboutthe following Taxa) and thus values from the relevant attributes can be added there). - This would also apply to some Event metadata, as they would be relevant for the Dataset too. They can be chained via
potentialAction, but there may be more direct ways such as adding the date/time ranges of the event the dataset is about to the dataset metadata itself.
- Naturally, some of these can be embedded (i.e. "this dataset is
- For Occurrence metadata, I suppose some of this will be embedded in Datasets, so some of those properties are included below. I don't think ODIS harvesting OBIS Occurrence metadata is on the cards yet.
- there are multiple namespaces (dwc, dwciri, dc, dcterms, etc) I'm not adding all of these in the list below, but assume that they would be collapsed or both included in an export so something's not missed because of a trivial namespace mismatch.
- The MeasurementOrFact elements can be tricky, as some may be variableMeasured and others some sort of additionalProperty or descriptive element. Maybe just adding them as additionalProperties is best for now. I don't include the properties relevant to these below. The are interesting, but could get massive if included. Exceptions for MIxS terms, or other standards embedded in DwC, perhaps.
- The Resource / relationship terms are very interesting for linked data, but I think out of scope for now.
- The Sample terms are also interesting if the Event Type is exported, but less so for Dataset metadata that we're concerned with now. Omitted here.
Add to embedding in ODIS records (for taxomonic levels, @pieterprovoost noted these may only go down to order or family to not flood the metadata, the rest would be available in the OBIS records):
- [ ] dwc:acceptedNameUsage
- [ ] dwc:acceptedNameUsageID
- [ ] dwc:associatedSequences
- [ ] dwc:associatedTaxa
- [ ] dwc:bed
- [ ] dwciri:behavior / dwc:behavior
- [ ] dwc:class
- [ ] dwc:degreeOfEstablishment
- [ ] dwc:family
- [ ] dwc:fieldNotes
- [ ] dwc:fieldNumber
- [ ] dwc:genericName
- [ ] dwc:genus
- [ ] dwc:GeologicalContext
- [ ] dwc:habitat
- [ ] dwc:higherClassification
- [ ] dwc:identifiedBy
- [ ] dwc:informationWithheld
- [ ] dwc:kingdom
- [ ] dc:language
- [ ] dwc:MaterialEntity
- [ ] dwc:MaterialSample (cross-links to #376)
- [ ] dwc:materialSampleID
- [ ] dwc:nomenclaturalCode
- [ ] dwc:Occurrence
- [ ] dwc:occurrenceDetails
- [ ] dwc:occurrenceRemarks
- [ ] dwc:order
- [ ] dwc:originalNameUsage
- [ ] dwc:phylum
- [ ] dwc:scientificName
- [ ] dwc:superfamily
- [ ] dwc:taxonAttributes
- [ ] dwc:taxonID
- [ ] dwc:verbatimIdentification
- [ ] dwc:vernacularName
Map to schema.org properties
- [ ] dwc:accessRights
- [ ] dwc:associatedMedia
- [ ] dwc:associatedReferences
- [ ] dcterms:bibliographicCitation
- [ ] dwc:continent
- [ ] dwc:country / dwc:countryCode
- [ ] dwc:county
- [ ] dwc:dataGeneralizations (to additonal description or similar)
- [ ] dwc:datasetID
- [ ] dwc:datasetName
- [ ] dwc:day
- [ ] dwc:endDayOfYear
- [ ] dwc:establishmentMeans
- [ ] dwc:eventDate
- [ ] dwc:eventID
- [ ] dwc:eventRemarks (comment or description on schema:Event)
- [ ] dwc:eventTime
- [ ] dwc:GeologicalContext
- [ ] dwc:higherGeography
- [ ] dwciri:inDataset (subjectOf)
- [ ] dwc:institutionCode
- [ ] dwc:institutionID
- [ ] dcterms:license
- [ ] dwc:measurementMethod
- [ ] dcterms:modified (sd properties)
- [ ] dwc:month
- [ ] dcterms:references
- [ ] dwc:relatedResourceID
- [ ] dcterms:rights
- [ ] dcterms:rightsHolder
- [ ] dwc:startDayOfYear
- [ ] dc:type
- [ ] dwc:year
Spatial mapping to GeoJSON and/or schema.org spatial properties in their stanzas, may be a bit involved, but worth it (I assume many of these are already mapped):
- [ ] dwc:coordinatePrecision
- [ ] dwc:coordinateUncertaintyInMeters
- [ ] dwc:decimalLatitude
- [ ] dwc:decimalLongitude
- [ ] dwc:footprintSRS
- [ ] dwc:footprintWKT
- [ ] dwc:geodeticDatum
- [ ] dwc:locality
- [ ] dwc:locationRemarks
- [ ] dwc:maximumDepthInMeters (xref #377)
- [ ] dwc:maximumDistanceAboveSurfaceInMeters
- [ ] dwc:maximumElevationInMeters
- [ ] dwc:minimumDepthInMeters
- [ ] dwc:minimumDistanceAboveSurfaceInMeters
- [ ] dwc:minimumElevationInMeters
- [ ] dwc:municipality
- [ ] dwc:stateProvince
- [ ] dwc:verbatimCoordinates
- [ ] dwc:verbatimCoordinateSystem
- [ ] dwc:verbatimDepth
- [ ] dwc:verbatimElevation
- [ ] dwc:verbatimEventDate
- [ ] dwc:verbatimLatitude
- [ ] dwc:verbatimLocality
- [ ] dwc:verbatimLongitude
- [ ] dwc:verbatimSRS
- [ ] dwc:verticalDatum
- [ ] dwc:waterBody
Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?
Most of these fields will have high cardinality, how do you envision we handle this in metadata documents?
In the sense that a Dataset can be about potentially thousands of Taxa? Aggregation at higher ranks I think.
For Events, maybe taking the extreme values of space and time and creating an inclusive pocket to push to the Dataset metadata.
For things that hang off of Occurrence, like dwc:habitat, that's trickier - arrays in Dataset properties like about come to mind, but this may be one too many jumps.
If OBIS eventually releases truncated metadata about the other types (Events, Taxa, maybe Occurrences for specific species [e.g. of concern, keystones, invasives]) this would of course be easier from the Dataset metadata (via @id referencing). Maybe that can wait for that stage.
These are fields that I think would be useful for ODIS-level discovery of OBIS resources - if adding them is prohibitively complex or would put prohibitive demands on the systems involved, we can mark them for later consideration.
Could you check mark the terms above that you think are the most feasible to add now? We can discuss how to add some high-value ones that are harder in a meeting perhaps.
And I'm quite sure that some of the DwC value syntax will conflict with schema.org / OGC constraints - that's important to note, even if those properties don't make it into the JSON-LD/schema.org products. Those are a basis to trigger later alignment of the standards themselves, hopefully.