science-on-schema.org icon indicating copy to clipboard operation
science-on-schema.org copied to clipboard

Identifying the preferred from multiple identifiers

Open datadavev opened this issue 4 years ago • 19 comments

An instance SO:Dataset(and other schema.org entities) may have more than one identifier.

What is the appropriate mechanism to indicate to consumers which identifier is preferred when referencing the dataset?

For example, the following provides no indication of a preferred identifier:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}

Options for indicating preference include:

1. Using the @id of the identifier PropertyValue

Example, with "ark:99999/ident/01" preferred:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02-preferred",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}
  • +ve Fairly obvious which identifier is preferred
  • -ve Imposes semantics on the value of @id for an identifier, which seems a bit cumbersome
  • -ve Not language agnostic

2. Using a disambiguatingDescription property in the PropertyValue

Example, with "ark:99999/ident/01" preferred:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01",
      "disambiguatingDescription":"preferred"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}
  • +ve Obvious which identifier is preferred
  • -ve Not language agnostic
  • -ve No controlled vocabulary
  • -ve A bit verbose, adds "weight"

3. Values of sameAs property are not preferred

Example, with "ark:99999/ident/01" preferred:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ],
  "sameAs":"doi:10.9999/ident/01"
}
  • +ve Simple to implement and process
  • +ve Works with identifiers expressed as simple strings or PropertyValue
  • -ve Potential confusion over which identifier should be the value
  • -ve The description of sameAs indicates values should be a URL, which may not align with identifier values

4. The first identifier in a list is preferred, if the @container is set to @list

{
  "@context":{
    "@vocab":"https://schema.org/"
    "identifier": {
      "@container":"@list"
    }
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}

or

{
  "@context":{
    "@vocab":"https://schema.org/"
    "identifier": {
      "@container":"@list"
    }
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
      "ark:99999/ident/01",
      "doi:10.9999/ident/01"
  ]
}
  • +ve Easy to implement and process
  • +ve Works well with any style of identifier (simple string or PropertyValue)
  • -ve It is arbitrary to suggest that the first entry is preferred
  • -ve Requires specifying @list or order may not be preserved in processors

datadavev avatar Jan 06 '21 17:01 datadavev

Fantastic summary, @datadavev -- thanks for writing it up. I agree with all of your pro/cons assessments. None of the options seems particularly compelling to me, as they all have annoying aspects, so I am not sure which of them I would want to support. I guess just picking one is better than leaving things to chance. It would be nice to have a positive indication of the preference. Some more pros/cons:

  • Option 1:
    • -ve: Overloads @id and may prevent use of existing LOD URIs that need to be expressed
  • Option 3:
    • -ve: Lacks positive indication of the preference
    • -ve: Semantics of preference is ambiguous

Given Dave's pros/cons plus these, I guess I prefer option 4, but could easily be swayed otherwise, and I would still like to find a better solution that is more obvious and a 'clean' implementation. Maybe another alternative 5 will arise here in the issue discussion.

mbjones avatar Jan 06 '21 17:01 mbjones

Also, are we targeting milestone 1.2 or 1.3 with this decision? I'm going to assume 1.3 given the timing of the ESIP winter meeting... but feel free to change it if someone wants to get this pushed into 1.2.

mbjones avatar Jan 06 '21 18:01 mbjones

We sometimes use a datatype on an identifier literal to assist in disambiguation of multiple identifiers. For example

{
  "@id" : "my:Resource99",
  "@type" : "my:Class88",
  "dcterms:identifier" : {
    "@type" : "my:typeA",
    "@value" : "id1"
  },
  "dcterms:identifier" : {
    "@type" : "my:typeB",
    "@value" : "id2"
  },
  "@context" : {
... etc ...
  }
}

dr-shorthair avatar Jan 06 '21 23:01 dr-shorthair

Thanks @dr-shorthair, great suggestion. I note that you used dcterms:identifier and assigned a value type to the node value, while we are discussing how to distinguish instances of schema:identifier. The range of schema:identifier is one of schema:PropertyValue, schema:URL and schema:Text, so we have recommended that people use the node type schema:PropertyValue. But I like your idea of using a @type to indicate which is preferred. So, some more options...we could either do that by 1) adding another node type, or by subclassing schema:PropertyValue to ex:PreferredPropertyValue and recommending that. Here are those two options:

5. Using an another type for the identifier in addition to PropertyValue

Example, with "ark:99999/ident/01" preferred:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type": ["PropertyValue", "ex:PreferredIdentifier"],
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}
  • +ve Obvious which identifier is preferred
  • +ve Language agnostic
  • +ve Controlled vocabulary
  • -ve Uses a class not in schema.org

6. Using an subproperty of PropertyValue for the identifier

In this case, assume a vocabulary in which ex:PreferredIdentifier is defined as a subproperty of schema:PropertyValue.

Example, with "ark:99999/ident/01" preferred:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type": ["ex:PreferredIdentifier"],
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ]
}
  • +ve Obvious which identifier is preferred
  • +ve Language agnostic
  • +ve Controlled vocabulary
  • -ve Uses a class not in schema.org

mbjones avatar Jan 07 '21 03:01 mbjones

(Of course it all looks a lot neater in Turtle than it does in JSON

my:Resource99 a my:Class88 ;
  dcterms:identifier "id1"^^my:typeA ;
  dcterms:identifier "id2"^^my:typeb ;
.

)

dr-shorthair avatar Jan 07 '21 03:01 dr-shorthair

It's an interesting suggestion. Creating a new class and the associated management it encumbers feels a little heavy for this case. It does raise a couple questions for me: Is preference is a property of the identifier or of the Thing to which it assigned? Can we enforce cardinality, so there is one instance of ex:PreferredIdentifier? Would an instance of ex:PreferredIdentifier be used elsewhere and possibly create confusion with more than one preferred instance?

If going down the path of creating new classes or properties, I think my preference would be to add a property to the subject of the identifier (i.e. the Dataset in these examples) to indicate which identifier is preferred when there is more than one choice.

datadavev avatar Jan 07 '21 14:01 datadavev

There is a schema.org ChooseAction construct that may be applicable. I think the example might be like the following, but not sure action is even appropriate for this:

{
  "@context":{
    "@vocab":"https://schema.org/"
  },
  "@type":"Dataset",
  "@id":"gid.01",
  "identifier":[
    {
      "@id":"gid.02",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/ark",
      "value":"ark:99999/ident/01"
    },
    {
      "@id":"gid.03",
      "@type":"PropertyValue",
      "propertyID":"https://registry.identifiers.org/registry/doi",
      "value":"doi:10.9999/ident/01"
    }
  ],
  "potentialAction":{
    "@type":"ChooseAction",
    "object":"gid.02",
    "actionOption":"identifier"
  }
}

datadavev avatar Jan 07 '21 14:01 datadavev

Great to see all the lively discussion and input! Would it be helpful to define what we mean by 'preferred'. What triggers a preference, when or what triggers a change in preference, and how preference will be consumed/interpreted? Maybe a couple competency questions could help us get to the solution faster with more clarity?

ashepherd avatar Jan 07 '21 14:01 ashepherd

So I can say with some confidence that in the Polar community, the issue is not so much which identifier is preferred; but rather which is the identifier from the authoritative or originating repository (since records are harvested, re-identified, etc. so much so that the repository that really has that data ends up not getting credit for it and its users - instead some other place gets credit because its identifier is the one that is cited and used).

Here is an example (not from schema.org) of the problem - yes it only has one identifier and the wrong identifier to boot; but it was the quickest example of the kind of problem the community needs resolved. We need to prevent this kind of problem with schema.org metadata. So yes the answer to this ticket is very related to #37

image

rduerr avatar Jan 07 '21 18:01 rduerr

Excellent point @rduerr. When I first posed the question, my perspective was quite narrow with the goal being "simply" to determine which identifier should be presented in a user interface as being preferable for referencing the Dataset. Another aspect was consistency over subsequent harvests, since ordering of identifiers presented in a Dataset can not be relied on without additional information.

These goals and those in #37 certainly seem related. Perhaps in place of "preferred" we should be indicating "authoritative". I'm not sure if it would be necessary to generalize further and specify the purpose or role of identifiers that appear attached to a Thing. "Authoritative" and "preferred" would seem to me at least to be synonymous, since the authoritative identifier should be preferred and promoted. Are there other roles for identifier that would benefit from further clarification? e.g. could / should a "citation" identifier (one used when citing the Dataset in a publication) be different from an "authoritative" identifier?

datadavev avatar Jan 07 '21 19:01 datadavev

I definitely think "authoritative" is the term to use and it should be the one used in citations!!!

rduerr avatar Jan 07 '21 20:01 rduerr

see also #135, should option 4 above apply.

This issue should be resolved with at least guidance in release 1.3 as it significantly impacts consumers that need to keep track of content containing multiple identifiers and promoting the correct one from a selection.

datadavev avatar Feb 01 '21 23:02 datadavev

here's another possible solution. Use schema:additionalType on the PropertyValue for an identifier:

7. use additionalType

    "identifier": [
        {
            "@type": "PropertyValue",
            "additionalType":"https://soso.org/authoritativeidentifier",
            "propertyID": "https://registry.identifiers.org/registry/ark",
            "name": "ARK: 13030/c7833mx7t",
            "value": "ark:13030/c7833mx7t",
            "url": "https://n2t.net/ark:13030/c7833mx7t"
        },
        {
            "@type": "PropertyValue",
            "additionalType":"https://soso.org/alternateidentifier",
            "propertyID": "https://registry.identifiers.org/registry/pubmed",
            "name": "Pubmed ID #16333295",
            "value": "pubmed:16333295",
            "url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295"
        }
]

smrgeoinfo avatar Feb 15 '21 20:02 smrgeoinfo

I note that this conversation of the "preferred" identifier has now converged with issue #37 about indicating the "authoritative" copy of a dataset when it exists in multiple locations (possibly under the same identifier, or, as described here, a different identifier). I think these two issues should be resolved in concert.

mbjones avatar May 12 '21 02:05 mbjones

Discussion from telecon 2021-06-07. Use prov:hadPrimarySource to indicate the authoritiative source. Should this be the propertyID (one of several?), or use some other property? { "@type": "PropertyValue", "additionalType":"https://soso.org/alternateidentifier", "propertyID": [ "https://registry.identifiers.org/registry/pubmed", "http://www.w3.org/ns/prov#hadPrimarySource" ] "name": "Pubmed ID #16333295", "value": "pubmed:16333295", "url": "http://www.ncbi.nlm.nih.gov/pubmed/16333295" }

smrgeoinfo avatar Jun 07 '21 22:06 smrgeoinfo

This is the definition of prov:primarySource:

A primary source for a topic refers to something produced by some agent with direct experience and knowledge 
about the topic, at the time of the topic's study, without benefit from hindsight. Because of the directness of 
primary sources, they 'speak for themselves' in ways that cannot be captured through the filter of secondary 
sources. As such, it is important for secondary sources to reference those primary sources from which they 
were derived, so that their reliability can be investigated. A primary source relation is a particular case of 
derivation of secondary materials from their primary sources. It is recognized that the determination of 
primary sources can be up to interpretation, and should be done according to conventions accepted 
within the application's domain.

It is meant to be used as a predicate to relate a secondary source to the primary source from which it was derived, which may not be what we are trying to do here. In our case, the schema:Dataset has one or more identifier properties, and we are trying to state that one of those is authoritative for the purposes of citation and reference. But the Dataset is not really derived from the identifier as a secondary source. I think Steven's example says that the Dataset has an identifier which is a PropertyValue with a propertyId that is somehow pointing at a primary source -- implying that the Dataset is a secondary source? I think it would be clearer to say that the Dataset itself has a primary identifer (:dataset1 schema:hasPrimarySource <pubmed:16333295>), but I don't think that makes sense either because it implies :dataset1 is a secondary source. I think we need a better property for this, or to use identifier order (with @list) or some other convention to establish which is authoritative.

mbjones avatar Jun 08 '21 01:06 mbjones

To-DO (ESIP Summer Meeting):

  1. Pick one and draft guidelines

  2. GOAL: in building a citation for a harvested record, the aggregator needs to decide which identifier is provided in that citation.

  3. Discuss multiple pathways at a cluster telecon

ashepherd avatar Jul 23 '21 20:07 ashepherd

During the 2021 Summer meeting, we discussed how the preferred identifier is chosen to be used in so:citation elements on the DataCite schema.org entry. As promised, I asked DataCite about this, and Kristian Garza replied with:

ok, that’s clearer. so take for example https://api.datacite.org/dois/application/ld+json/10.4121/uuid:3e4aee44-0857-40e3-8492-1cb37ac1e189 Basically the embedded schema.org metadata will include any relatedIdentifier of the relationTypes “References”, “Cites” or “Documents” that were to be included in the same DOI’s metadata. So if the metadata includes no relatedIdentifiers of any of those relationTypes there would be no entry in the so:ctiation element

mbjones avatar Jul 28 '21 16:07 mbjones

Notes from 8/26/2021 meeting:

  • Proposing we not delay v1.3 for this issue
  • Depends on #135
  • Preferred might mean different things to different organizations. Is it the canonical identifier? If your not the authority, does the harvester trust that the preference of the schema.org document maker knows the full chain of custody?

ashepherd avatar Aug 28 '21 01:08 ashepherd