h
h copied to clipboard
prevent ongoing document equivalence explosions
The following URL was assigned in a course:
https://www.npr.org/sections/health-shots/2010/08/11/129127432/paul-longmore-historian-and-advocate-for-disabled-dies
The course dashboard displayed this list of annotated documents:
These are all the "same" document, with document_id 22734. Most users won't (yet) be seeing this dashboard view, but for everyone our client will load and try to anchor ~2500 annotations, the vast majority of which do not apply to this article.
This happens because:
-
The docs share this common metadata declaration: {href: "https://feeds.npr.org/feeds/103537970/feed.json", rel: "alternate", type: "application/json"}
-
We formerly used
rel="alternate"
to form equivalences, until we realized that this pattern -- a feed URL that's common to a whole class of documents on a site -- was causing an explosion of unwanted equivalence.
If nobody had ever annotate an NPR health article containing https://feeds.npr.org/feeds/103537970/feed.json as rel="alternate" I don't think we'd be seeing this problem, because the linking records in document_uri
would not exist. But they do, and rewriting the database to untangle things isn't an option because we merge equivalences.
But can we at least prevent creating new false equivalences, so that the next previously-unannotated NPR health article doesn't land in the same bucket?
One way to do that would be to stop the client from sending link elements except for rel="canonical"
.
Currently, here, we allow alternate
, canonical
, bookmark
, and shortlink
.
Consider two previously unannotated articles that use <link rel="alternate" type="application/rss+xml" title="Health" href="https://feeds.npr.org/1128/rss.xml">
:
1 https://www.npr.org/2020/09/11/911828384/trump-says-he-downplayed-coronavirus-threat-in-u-s-to-avert-panic
2 https://www.npr.org/sections/health-shots/2020/09/11/911885577/as-covid-19-vaccine-trials-move-at-warp-speed-recruiting-black-volunteers-takes-
Annotating 1 creates document_id 22,734 and merges it with all the others.
Now I'll annotate 2 using a client that shortens the array of allowed link elements from ['alternate', 'canonical', 'bookmark', 'shortlink'] to ['canonical']. We get a fresh document_id, 1,125,647.
An alternate solution would be to keep sending these links and stop the equivalence on the back end. The client-side solution is easiest for me to demo, so that's what I'm illustrating here.