How to reference an observatory or mission in a Dataset record?
We are interested in referencing the observatory or mission that generated the dataset in the Dataset record on schema.org. We found a few options and would like input from this group, ideally resulting in an update to the SOSO guidance. Thanks to Zach Boquet for the first json example.
OPTION 1: use the "producer" field { "@context": { "@vocab": "https://schema.org/", "prov": "http://www.w3.org/ns/prov#" }, "@id": "https://doi.org/10.concept/doi", "@type": "Dataset", …. "producer": [ {"@type": "ResearchProject", "@id": "spase://SMWG/Observatory/MMS", "name": "MMS", "url": "https://mms.gsfc.nasa.gov/" }, {"@type": "ResearchProject", "@id": "spase://SMWG/Observatory/MMS/4", "name": "MMS-4", "url": "https://mms.gsfc.nasa.gov/" } ], ... } pros: no special items needed, can indicate individual portions of a mission (e.g. MMS has 4 spacecraft, can indicate the MMS mission AND which spacecraft). cons: doesn't seem to fit the definition of "producer".
option 2: use prov: wasGeneratedBy similar to current guidance for software https://www.w3.org/TR/prov-o/#wasGeneratedBy (example edited from current SOSO guidelines for software, please check this) { "@context": [ "https://schema.org/", { "prov": "http://www.w3.org/ns/prov#", "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#" } ], "@id": "https://doi.org/10.xxxx/Dataset-2", "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "prov:wasDerivedFrom": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "schema:isBasedOn": { "@id": "https://doi.org/10.xxxx/Dataset-1" }, "prov:wasGeneratedBy": {"@type": "ResearchProject", "@id": "spase://SMWG/Observatory/MMS", "name": "MMS", "url": "https://mms.gsfc.nasa.gov/" }, {"@type": "ResearchProject", "@id": "spase://SMWG/Observatory/MMS/4", "name": "MMS-4", "url": "https://mms.gsfc.nasa.gov/" } }
pros: likely much simpler, type can be aligned with the choice for the observing network work. cons: could be confused with the current guidelines explaining how to indicate a software was used to generate the dataset
Interesting issue, @rebeccaringuette I think I prefer the first option, but I'm unclear why it doesn't fit the defnintion of producer.
Regarding the second option, in the PROV model I think it would be better if you use the prov:wasAttributedTo property to indicate that the prov:Entity (in this case a Dataset) should be attributed to the project(s). The prov:wasGeneratedBy property doesn't seem right to me because it has as its range a prov:Activity, which represents a process that does the generation (such as the activity of executing a software program), and not the people, projects or organizations that performed the activity. Another way of looking at it is that I think that your Project should be modeled as a subclass of prov:Agent and not as a subclass of prov:Activity.
Well, my hesitation is that the institution that initially serves the data might be the better entry into the producer field rather than the mission or observatory. Another option to consider here might be to use measurementTechnique for this purpose since it puts links to missions and observatories right next to the instrument links. So it seems there are three options:
- schema:producer
- prov:wasAttributedTo
- schema:measurementTechnique
We are interested in using whichever method is already used by the science community so that our links to missions/observatories are more easily discovered. Link to current version of an example schema.org record: https://github.com/Kurokio/soso-spase/blob/draft/SOSO_Draft.json
@BaptisteCecconi Are you currently using schema.org to map observatories in some way? What are your thoughts? We are currently leaning towards using schema:producer, but as Matt suggested above, we could instead use prov:wasAttributedTo if that is preferred.
To me the schema:producer is really about media production entities (as in: the entity who oversees, manages, organises the production of a film or a music album). In our case, the example of "producers" are spacecraft objects, which need to be abstracted in order to include the team managing the instruments and "producing" the dataset.
I would prefer the PROV model here. However (and we already stumbled upon this), our information model is not really rich enough for fully adopting the PROV ontology.
The basics of the PROV ontology is shown in this figure:
I recall the definitions of the Prov ontology classes (copy pasted from PROV-O):
- An prov:Entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.
- An prov:Activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities.
- An prov:Agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
In the SPASE model, all objects (NumericalData, Observatory, Instrument...) should be typed as prov:entity.
So strictly speaking, provenance relations should be prov:wasDerivedFrom, as is:
{
"@context": [
"https://schema.org/",
{
"prov": "http://www.w3.org/ns/prov#"
}
],
"@id": "https://doi.org/10.concept/doi",
"@type": "Dataset",
...
"prov: wasDerivedFrom": [
{
"@type": ["ResearchProject", "prov:Entity"],
"@id": "spase://SMWG/Observatory/MMS",
"name": "MMS",
"url": "https://mms.gsfc.nasa.gov/"
},
...
],
...
}
If we want to use prov:wasGeneratedBy, we need to extended our "research project" with a prov:Activity typing. This makes some sense, since an prov:Activity must have a string time and stop time, and is using a prov:Entity to conduct its activity. Here is what we should then do:
{
"@context": [
"https://schema.org/",
{
"prov": "http://www.w3.org/ns/prov#"
}
],
"@id": "https://doi.org/10.concept/doi",
"@type": "Dataset",
...
"prov:wasGeneratedBy": [
{
"@type": ["ResearchProject", "prov:Activity"],
"prov:used": {
"@type": ["Instrument", "prov:Entity"],
"@id": "spase://SMWG/Observatory/MMS"
"name": "MMS",
},
"url": "https://mms.gsfc.nasa.gov/",
"name": "MMS project"
},
...
],
...
}
I would not use prov:wasAttributedTo since it requires a prov:Agent (the actual team or person), but Prov ontology doesn't provide links from Agents to Activities.
Thanks @BaptisteCecconi. Your comment was very helpful. Of the two options your described, I lean towards the 'wasGeneratedBy' one because choosing a different prov relation than the same one used as a relationship between datasets helps programmatically distinguish between datasets -> datasets (wasDerivedFrom) and missions/observatories -> datasets (wasGeneratedBy). The existing example in the dataset guidance shows this only for software, but the way you explained it here also makes sense. @mbjones Does this make sense to you? What do you think?
@BaptisteCecconi I noticed you put the @id field for the mission inside the prov:used object but the name and url for the mission is outside of that object. Was this on purpose? Should we add the identifier field inside the prov:used object where the @id field is or outside where the name is? Why are there two different names for MMS in your example?
Here, @dr-shorthair mentioned "Note that SSN/SOSA has a fairly solid framework for this". If I (perhaps badly) apply those ideas to an observatory, would an observatory be a sosa:System?
{
"@context": [
"https://schema.org/",
{
"prov": "http://www.w3.org/ns/prov#"
}
],
"@id": "https://doi.org/10.concept/doi",
"@type": "Dataset",
...
"prov:wasGeneratedBy": [
{
"@type": ["ResearchProject", "prov:Activity"],
"prov:used": {
"@type": ["Instrument", "prov:Entity", "sosa:System"],
"@id": "https://hpde.io/SMWG/Observatory/MMS.html"
"name": "MMS",
},
"url": "https://mms.gsfc.nasa.gov/",
"name": "MMS project"
},
...
],
...
}
This validates in schema.org. Thoughts? @Kurokio helped with this.
"prov:wasGeneratedBy": [
{
"@type": ["ResearchProject", "prov:Activity"],
"prov:used":
{
"@id": "https://hpde.io/SMWG/Observatory/MMS.html",
"@type": ["ResearchProject", "prov:Entity", "sosa:System"],
"name": "MMS",
"identifier": "https://hpde.io/SMWG/Observatory/MMS.html",
"url": "https://hpde.io/SMWG/Observatory/MMS.html"
}
},
{
"@type": ["ResearchProject", "prov:Activity"],
"prov:used":
{
"@id": "https://hpde.io/SMWG/Observatory/MMS-4.html",
"@type": ["ResearchProject", "prov:Entity", "sosa:System"],
"name": "MMS-4",
"identifier": "https://hpde.io/SMWG/Observatory/MMS-4.html",
"url": "https://hpde.io/SMWG/Observatory/MMS-4.html"
}
}
]
If an 'Observatory' hosts multiple sensors, then it could be modeled as a sosa:System with subsystems, each of which is a sosa:Sensor.
Maybe better would be to see an observatory as a sosa:Platform, which hosts sensors as part of a sosa:Deployment.
See https://w3c.github.io/sdw-sosa-ssn/ssn/#Systems-and-their-Deployment-overview and https://w3c.github.io/sdw-sosa-ssn/ssn/#LocatedDeployment
@rebeccaringuette This is up to us to define what belongs to the prov:Activity and to the prov:Entity. Your proposal to put everything in the entity is fine.
My comment would be to use the SPASE resource ID in the @id field, and define the identifier a little more specifically:
"@type": ["ResearchProject", "prov:Activity"],
"prov:used":
{
"@id": "spase://SMWG/Observatory/MMS",
"@type": ["ResearchProject", "prov:Entity", "sosa:System"],
"name": "MMS",
"identifier": {
"@type": "PropertyValue",
"propertyID": "SPASE resource ID"
"value": "spase://SMWG/Observatory/MMS",
"url": "https://hpde.io/SMWG/Observatory/MMS.html"
},
About the mapping with SOSA: this is fine, but the SPASE also have spase:Observatory (which could be easily mapped to sosa:platform) and spase:Instrument (which should map to sosa:system). The other components of SOSA (like sosa:sensor are finer grain than what we have in the SPASE information model)
I think the use of PROV properties should be avoided unless your instance of schema:Dataset is also an instance of dcat:Dataset for which the use of PROV is encouraged. The examples above might valdidate but I think this is just because the Google structured data tool ignores all non-schema.org properties.
@BaptisteCecconi Thanks. We have been advised to make sure the item in the @id field is a link, not just an identifier. We can definitely change the sosa:System to be sosa:Platform. That does make more sense. @Kurokio @dr-shorthair Thanks for the guidance, it is very helpful when attempting to navigate a new system. We don't have separate descriptions of deployments in SPASE, just observatories/missions and instruments, but linking at least those will be useful. @huberrob This is somewhat tricky. SPASE is used by more than the NASA data repository, but the datasets described in SPASE records should also be datasets by dcat standards, generally speaking, so we should be fine. I don't know of any exceptions.
About the content of the @id field: in the JSON-LD definition, it is told that @id are node identifiers and should be IRIs. The IRIs are an extension of URIs, which syntax is well defined (see examples), and our SPASE resource identifiers are URIs.
I really don't want to be too pushy on this, but using the landing page URL (https://hpde.io/SMWG/Observatory/MMS.html) of the resource as a node reference (an @id) instead of its URI spase://SMWG/Observatory/MMS is a bad idea. The SPASE resource ID should not change, whereas the landing page URL in the https://hpde.io/ domain may change, and that URL in just one implementation of a SPASE resource landing page.
I agree with @BaptisteCecconi , the @id should identifier a node in a graph. This node is interpreted variously as a thing in the world (some dataset), or an information object in a computer system that is about that thing (e.g. a schema.org JSON-LD object. The expectation is that the URI for the node will dereference to obtain the JSON-LD object. See https://github.com/Cross-Domain-Interoperability-Framework/Discovery/issues/13
maybe a way to approach the original problem ('reference an observatory in a dataset record') would be to write down the intended information in narrative form, like:
The dataset was produced at observatory (or platform, system..etc) LL using instrument x. Instrument X acquires data using sensor Y. The instrument was calibrated by agent Z using procedure A. Data from the instrument was processed using software X, operated by Person Y. The data were acquired under the auspices of project AA, funded by B, C, and D, with PI person ZZ.
Regarding the @id it is of course OK to use a IRI as identifier within the graph (I came here from this other thread which is more Dataset centric.. #225). Isn't there a standard resolver for spase identifiers similar to handle.net or doi.org?