h
h copied to clipboard
meta name="dc.identifier" content="https://dx.doi.org/..." yields unexpected equivalence result
Here is a set of equivalences created using <meta name="dc.identifier" content="10.1000/ee9">
and <meta name="dc.identifier" content="doi:10.1000/ee9">
And here is the set created using <meta name="dc.identifier" content="https://dx.doi.org/10.1000/ee9">
In both cases, the dc.identifier value matches this server-side pattern:
https://github.com/hypothesis/h/blob/682764d8bbf46c9d8045162493b777484069fe57/h/util/document_claims.py#L28.
But the DOI-style URI generated in the second case doesn't match the one generated in the first case, and we end up with two disjoint sets of annotations.
We currently have ~15K document_uri
records like doi:10.1000/...' and ~2K like
doi:http(s)://dx.doi.org/10.1000/...`
This likely isn't much of a problem because most publishers asserting DOIs use both the Highwire and DC syntaxes. But it's something to be aware of.