add and update citation and provenance metadata to DOI registrations
When we have datasets that have known provenance and citation relationships, we should update our DataCite Kernel metadata with that information so that it is visible in EventData APIs. Guidance o how to do so is here in the Event Data Guide.
One source of information for these relationships is our provenance graph in the ORE. Another are the various citation fields in EML (esp. EML 2.2) and ISO metadata. A third would be in EML 2.2 semantic annotations. Let's discuss how we could and should draw from all of these.
This task is a extension of #1267, and related to #1083.
Just to give some context on how this came up from the Whole Tale side (and what we'd like to accomplish).
If a Whole Tale user uses a DataONE dataset in their Tale, which they then publish to DataONE, the dataset should get credit for being used (via a citation count increase).
I prefer the EML method of citing another data source over the provenance graph (I'm not sure if the prov graph allows editing this field, but I think it would be best to restrict the user). @taojing2002 we can discuss this over slack, or over the next dev call now that we have issues made for this.
Yeah, it is good to discuss it on the dev meeting. Thanks.
On 4/2/19 12:10 PM, Thomas Thelen wrote:
Just to give some context on how this came up from the Whole Tale side (and what we'd like to accomplish).
If a Whole Tale user uses a DataONE dataset in their Tale, which they then publish to DataONE, the dataset should get credit for being used (via a citation count increase).
I prefer the EML method of citing another data source over the provenance graph (I'm not sure if the prov graph allows editing this field, but I think it would be best to restrict the user). @taojing2002 https://github.com/taojing2002 we can discuss this over slack, or over the next dev call now that we have issues made for this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/metacat/issues/1341#issuecomment-479153162, or mute the thread https://github.com/notifications/unsubscribe-auth/AHsIDMr8uMZrIR-iHIyZyqbuUvd0fVwPks5vc6sMgaJpZM4cRPc2.
We discussed this feature in the 04/04/19 dev call. The notes regarding this issue can be found here.
Summary (from notes)
We'll provide citation information in the ORE as well as the EML document. The first hurdles are deciding which vocabularies to use in the ORE, and how this will look in the EML. We'll additionally want to consider a user facing component to allows users access to this feature (possibly in the EML editor).
Additional questions that need to be investigated include
- Do packages inherit citations?
- How do we
Possible Next Steps
- Investigate using cito in the ORE
- Create a sample EML document with a package citation
Great that this is in the ORE and EML, but it also needs to get updated at DataCite with a new DataCite kernel metadata doc as described in #1267.
DataCite considers IsDerivedFrom as one of its supported citation types. Right now, they are saying ay of the following relationship types get counted as a citation or not:
-- Include --
IsCitedBy
Cites
IsSupplementTo
IsSupplementedBy
Describes
IsDescribedBy
HasMetadata
IsMetadataFor
IsReferencedBy
References
IsDocumentedBy
Documents
IsCompiledBy
Compiles
IsReviewedBy
Reviews
IsDerivedFrom
IsSourceOf
IsRequiredBy
Requires
-- Exclude --
IsContinuedBy
Continues
HasVersion
IsVersionOf
IsNewVersionOf
IsPreviousVersionOf
IsPartOf
HasPart
IsVariantFormOf
IsOriginalFormOf
IsIdenticalTo
IsObsoletedBy (added in schema 4.2 this month)
Obsoletes (added in schema 4.2 this month)
So, when updating our DOIs, I would argue that we should use the following properties in our DataCite XML:
- IsCitedBy/Cites: if we know of an article or report, etc. that cites the data
- IsDerivedFrom/IsSourceOf: for when one dataset is created with input from another (our classic provenance case)
We can certainly include others, but those two are key and probably shouldn't be conflated. Rushi says that all of the Crossref reported citations use references, but that is less-well defined than the others. Here are the relationship type defs from the DataCite spec:
- IsDerivedFrom
- indicates B is a source upon which A is based
- IsDerivedFrom should be used for a resource that is a derivative of an original resource. In this example, the dataset is derived from a larger dataset and data values have been manipulated from their original state.
- IsCitedBy
- indicates that B includes A in a citation
- IsReferencedBy
- indicates A is used as a source of information by B
See the full set of definitions on page 46 of the DataCite metadata specification.
After further discussion with @rushirajnenuji , the preferred approach seems to be to query the DataONE metrics service to get a list of known citations for a dataset. Between that and the provenance graph that we already have in the ORE, we should be able to report all of IsCitedBy, IsDerivedFrom, IsSourceOf, and possibly IsReferencedBy. Here's an example query from the metrics service that gets the citation info for the specific PID doi:10.18739/A2CZ3244X:
https://logproc-stage-ucsb-1.test.dataone.org/metrics?metricsRequest={%22metricsPage%22:{%22total%22:0,%22start%22:0,%22count%22:0},%22metrics%22:[%22citations%22],%22filterBy%22:[{%22filterType%22:%22dataset%22,%22values%22:[%22doi:10.18739/A2CZ3244X%22],%22interpretAs%22:%22list%22},{%22filterType%22:%22month%22,%22values%22:[%2201/01/2000%22,%2203/27/2020%22],%22interpretAs%22:%22range%22}],%22groupBy%22:[%22year%22]}
We don't really need the counts or monthly grouping, but they seem to be provided even though I only ask for "citations". If I omit the groupBy or the filterType for month, the query fails. We shouldn't be constraining the months to 2012, so that seems like a problem. @rushirajnenuji could you clarify what a shorter query would be?
This lookup and the provenance lookup are the key addition to DataCiteMetadataFactory to be able to register citations with DataCite.
Thank you for the write-up @mbjones . Currently, in the Metrics Service, we only support queries based on the MetacatUI use cases. One good thing is that we’ve designed the query object to be extensible. I agree that we do not need the groupBy and fitlerType objects for this Metacat query. Constraining queries to 2012 is definitely a problem. I'll get that fixed. For Metrics Service lookup in DataCiteMetadataFactory, this is what I’m proposing:
{
"metricsPage": {
"total": 0,
"start": 0,
"count": 0
},
"metrics": [
"citations”,
],
"filterBy": [
],
"groupBy": [
]
}
Since we’re reporting it to DataCite, we can include filterType object (name: identifier_type ?) to get only DOI citations from our Citations dataset:
"filterBy": [
{
"filterType": "identifier_type",
"values": [
"DOI"
],
"interpretAs": "list"
}
],
We'll have to add support for this query - but that should be quick. We also discussed opening up POST to enable MetacatUI (and MetacatUI users) to be able to register citations. Captured thoughts on this ticket: https://github.com/NCEAS/metacatui/issues/1245. I was wondering if we need to add support for provenance lookup in the Metrics Service to display those citations on MetacatUI.
@rushirajnenuji In my example above I used the following filter to limit the set to any PID associated with the dataset:
{"filterType":"dataset","values":["doi:10.18739/A2CZ3244X"],"interpretAs":"list"}
Does that work, or do we really need a new query type for identifier_type?