science-on-schema.org
science-on-schema.org copied to clipboard
Identifying the preferred from multiple identifiers
An instance SO:Dataset
(and other schema.org
entities) may have more than one identifier
.
What is the appropriate mechanism to indicate to consumers which identifier
is preferred when referencing the dataset?
For example, the following provides no indication of a preferred identifier
:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
Options for indicating preference include:
1. Using the @id
of the identifier
PropertyValue
Example, with "ark:99999/ident/01
" preferred:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02-preferred",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
- +ve Fairly obvious which identifier is preferred
- -ve Imposes semantics on the value of
@id
for anidentifier
, which seems a bit cumbersome - -ve Not language agnostic
2. Using a disambiguatingDescription
property in the PropertyValue
Example, with "ark:99999/ident/01
" preferred:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01",
"disambiguatingDescription":"preferred"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
- +ve Obvious which identifier is preferred
- -ve Not language agnostic
- -ve No controlled vocabulary
- -ve A bit verbose, adds "weight"
3. Values of sameAs
property are not preferred
Example, with "ark:99999/ident/01
" preferred:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
],
"sameAs":"doi:10.9999/ident/01"
}
- +ve Simple to implement and process
- +ve Works with identifiers expressed as simple strings or
PropertyValue
- -ve Potential confusion over which identifier should be the value
- -ve The description of
sameAs
indicates values should be a URL, which may not align with identifier values
4. The first identifier
in a list is preferred, if the @container
is set to @list
{
"@context":{
"@vocab":"https://schema.org/"
"identifier": {
"@container":"@list"
}
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
or
{
"@context":{
"@vocab":"https://schema.org/"
"identifier": {
"@container":"@list"
}
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
"ark:99999/ident/01",
"doi:10.9999/ident/01"
]
}
- +ve Easy to implement and process
- +ve Works well with any style of identifier (simple string or
PropertyValue
) - -ve It is arbitrary to suggest that the first entry is preferred
- -ve Requires specifying
@list
or order may not be preserved in processors
Fantastic summary, @datadavev -- thanks for writing it up. I agree with all of your pro/cons assessments. None of the options seems particularly compelling to me, as they all have annoying aspects, so I am not sure which of them I would want to support. I guess just picking one is better than leaving things to chance. It would be nice to have a positive indication of the preference. Some more pros/cons:
- Option 1:
- -ve: Overloads
@id
and may prevent use of existing LOD URIs that need to be expressed
- -ve: Overloads
- Option 3:
- -ve: Lacks positive indication of the preference
- -ve: Semantics of preference is ambiguous
Given Dave's pros/cons plus these, I guess I prefer option 4, but could easily be swayed otherwise, and I would still like to find a better solution that is more obvious and a 'clean' implementation. Maybe another alternative 5 will arise here in the issue discussion.
Also, are we targeting milestone 1.2 or 1.3 with this decision? I'm going to assume 1.3 given the timing of the ESIP winter meeting... but feel free to change it if someone wants to get this pushed into 1.2.
We sometimes use a datatype on an identifier literal to assist in disambiguation of multiple identifiers. For example
{
"@id" : "my:Resource99",
"@type" : "my:Class88",
"dcterms:identifier" : {
"@type" : "my:typeA",
"@value" : "id1"
},
"dcterms:identifier" : {
"@type" : "my:typeB",
"@value" : "id2"
},
"@context" : {
... etc ...
}
}
Thanks @dr-shorthair, great suggestion. I note that you used dcterms:identifier
and assigned a value type to the node value, while we are discussing how to distinguish instances of schema:identifier
. The range of schema:identifier
is one of schema:PropertyValue
, schema:URL
and schema:Text
, so we have recommended that people use the node type schema:PropertyValue
. But I like your idea of using a @type
to indicate which is preferred. So, some more options...we could either do that by 1) adding another node type, or by subclassing schema:PropertyValue
to ex:PreferredPropertyValue
and recommending that. Here are those two options:
5. Using an another type for the identifier in addition to PropertyValue
Example, with "ark:99999/ident/01
" preferred:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type": ["PropertyValue", "ex:PreferredIdentifier"],
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
- +ve Obvious which identifier is preferred
- +ve Language agnostic
- +ve Controlled vocabulary
- -ve Uses a class not in schema.org
6. Using an subproperty of PropertyValue
for the identifier
In this case, assume a vocabulary in which ex:PreferredIdentifier
is defined as a subproperty of schema:PropertyValue
.
Example, with "ark:99999/ident/01
" preferred:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type": ["ex:PreferredIdentifier"],
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
]
}
- +ve Obvious which identifier is preferred
- +ve Language agnostic
- +ve Controlled vocabulary
- -ve Uses a class not in schema.org
(Of course it all looks a lot neater in Turtle than it does in JSON
my:Resource99 a my:Class88 ;
dcterms:identifier "id1"^^my:typeA ;
dcterms:identifier "id2"^^my:typeb ;
.
)
It's an interesting suggestion. Creating a new class and the associated management it encumbers feels a little heavy for this case. It does raise a couple questions for me: Is preference is a property of the identifier
or of the Thing
to which it assigned? Can we enforce cardinality, so there is one instance of ex:PreferredIdentifier
? Would an instance of ex:PreferredIdentifier
be used elsewhere and possibly create confusion with more than one preferred instance?
If going down the path of creating new classes or properties, I think my preference would be to add a property to the subject of the identifier (i.e. the Dataset in these examples) to indicate which identifier is preferred when there is more than one choice.
There is a schema.org
ChooseAction
construct that may be applicable. I think the example might be like the following, but not sure action is even appropriate for this:
{
"@context":{
"@vocab":"https://schema.org/"
},
"@type":"Dataset",
"@id":"gid.01",
"identifier":[
{
"@id":"gid.02",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/ark",
"value":"ark:99999/ident/01"
},
{
"@id":"gid.03",
"@type":"PropertyValue",
"propertyID":"https://registry.identifiers.org/registry/doi",
"value":"doi:10.9999/ident/01"
}
],
"potentialAction":{
"@type":"ChooseAction",
"object":"gid.02",
"actionOption":"identifier"
}
}
Great to see all the lively discussion and input! Would it be helpful to define what we mean by 'preferred'. What triggers a preference, when or what triggers a change in preference, and how preference will be consumed/interpreted? Maybe a couple competency questions could help us get to the solution faster with more clarity?
So I can say with some confidence that in the Polar community, the issue is not so much which identifier is preferred; but rather which is the identifier from the authoritative or originating repository (since records are harvested, re-identified, etc. so much so that the repository that really has that data ends up not getting credit for it and its users - instead some other place gets credit because its identifier is the one that is cited and used).
Here is an example (not from schema.org) of the problem - yes it only has one identifier and the wrong identifier to boot; but it was the quickest example of the kind of problem the community needs resolved. We need to prevent this kind of problem with schema.org metadata. So yes the answer to this ticket is very related to #37
Excellent point @rduerr. When I first posed the question, my perspective was quite narrow with the goal being "simply" to determine which identifier
should be presented in a user interface as being preferable for referencing the Dataset
. Another aspect was consistency over subsequent harvests, since ordering of identifiers presented in a Dataset
can not be relied on without additional information.
These goals and those in #37 certainly seem related. Perhaps in place of "preferred" we should be indicating "authoritative". I'm not sure if it would be necessary to generalize further and specify the purpose or role of identifiers that appear attached to a Thing. "Authoritative" and "preferred" would seem to me at least to be synonymous, since the authoritative identifier should be preferred and promoted. Are there other roles for identifier that would benefit from further clarification? e.g. could / should a "citation" identifier (one used when citing the Dataset in a publication) be different from an "authoritative" identifier?
I definitely think "authoritative" is the term to use and it should be the one used in citations!!!
see also #135, should option 4 above apply.
This issue should be resolved with at least guidance in release 1.3 as it significantly impacts consumers that need to keep track of content containing multiple identifiers and promoting the correct one from a selection.
here's another possible solution. Use schema:additionalType on the PropertyValue for an identifier:
7. use additionalType
"identifier": [
{
"@type": "PropertyValue",
"additionalType":"https://soso.org/authoritativeidentifier",
"propertyID": "https://registry.identifiers.org/registry/ark",
"name": "ARK: 13030/c7833mx7t",
"value": "ark:13030/c7833mx7t",
"url": "https://n2t.net/ark:13030/c7833mx7t"
},
{
"@type": "PropertyValue",
"additionalType":"https://soso.org/alternateidentifier",
"propertyID": "https://registry.identifiers.org/registry/pubmed",
"name": "Pubmed ID #16333295",
"value": "pubmed:16333295",
"url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295"
}
]
I note that this conversation of the "preferred" identifier has now converged with issue #37 about indicating the "authoritative" copy of a dataset when it exists in multiple locations (possibly under the same identifier, or, as described here, a different identifier). I think these two issues should be resolved in concert.
Discussion from telecon 2021-06-07. Use prov:hadPrimarySource to indicate the authoritiative source. Should this be the propertyID (one of several?), or use some other property? { "@type": "PropertyValue", "additionalType":"https://soso.org/alternateidentifier", "propertyID": [ "https://registry.identifiers.org/registry/pubmed", "http://www.w3.org/ns/prov#hadPrimarySource" ] "name": "Pubmed ID #16333295", "value": "pubmed:16333295", "url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295" }
This is the definition of prov:primarySource
:
A primary source for a topic refers to something produced by some agent with direct experience and knowledge
about the topic, at the time of the topic's study, without benefit from hindsight. Because of the directness of
primary sources, they 'speak for themselves' in ways that cannot be captured through the filter of secondary
sources. As such, it is important for secondary sources to reference those primary sources from which they
were derived, so that their reliability can be investigated. A primary source relation is a particular case of
derivation of secondary materials from their primary sources. It is recognized that the determination of
primary sources can be up to interpretation, and should be done according to conventions accepted
within the application's domain.
It is meant to be used as a predicate to relate a secondary source to the primary source from which it was derived, which may not be what we are trying to do here. In our case, the schema:Dataset
has one or more identifier
properties, and we are trying to state that one of those is authoritative for the purposes of citation and reference. But the Dataset
is not really derived from the identifier as a secondary source. I think Steven's example says that the Dataset has an identifier which is a PropertyValue with a propertyId that is somehow pointing at a primary source -- implying that the Dataset is a secondary source? I think it would be clearer to say that the Dataset itself has a primary identifer (:dataset1 schema:hasPrimarySource <pubmed:16333295>
), but I don't think that makes sense either because it implies :dataset1
is a secondary source. I think we need a better property for this, or to use identifier order (with @list
) or some other convention to establish which is authoritative.
To-DO (ESIP Summer Meeting):
-
Pick one and draft guidelines
-
GOAL: in building a citation for a harvested record, the aggregator needs to decide which identifier is provided in that citation.
-
Discuss multiple pathways at a cluster telecon
During the 2021 Summer meeting, we discussed how the preferred identifier is chosen to be used in so:citation
elements on the DataCite schema.org entry. As promised, I asked DataCite about this, and Kristian Garza replied with:
ok, that’s clearer. so take for example https://api.datacite.org/dois/application/ld+json/10.4121/uuid:3e4aee44-0857-40e3-8492-1cb37ac1e189 Basically the embedded schema.org metadata will include any relatedIdentifier of the relationTypes “References”, “Cites” or “Documents” that were to be included in the same DOI’s metadata. So if the metadata includes no relatedIdentifiers of any of those relationTypes there would be no entry in the so:ctiation element
Notes from 8/26/2021 meeting:
- Proposing we not delay v1.3 for this issue
- Depends on #135
- Preferred might mean different things to different organizations. Is it the canonical identifier? If your not the authority, does the harvester trust that the preference of the schema.org document maker knows the full chain of custody?