h prevent ongoing document equivalence explosions

prevent ongoing document equivalence explosions

Open judell opened this issue 4 years ago • 0 comments

The following URL was assigned in a course:

https://www.npr.org/sections/health-shots/2010/08/11/129127432/paul-longmore-historian-and-advocate-for-disabled-dies

The course dashboard displayed this list of annotated documents:

These are all the "same" document, with document_id 22734. Most users won't (yet) be seeing this dashboard view, but for everyone our client will load and try to anchor ~2500 annotations, the vast majority of which do not apply to this article.

This happens because:

The docs share this common metadata declaration: {href: "https://feeds.npr.org/feeds/103537970/feed.json", rel: "alternate", type: "application/json"}
We formerly used rel="alternate" to form equivalences, until we realized that this pattern -- a feed URL that's common to a whole class of documents on a site -- was causing an explosion of unwanted equivalence.

If nobody had ever annotate an NPR health article containing https://feeds.npr.org/feeds/103537970/feed.json as rel="alternate" I don't think we'd be seeing this problem, because the linking records in document_uri would not exist. But they do, and rewriting the database to untangle things isn't an option because we merge equivalences.

But can we at least prevent creating new false equivalences, so that the next previously-unannotated NPR health article doesn't land in the same bucket?

One way to do that would be to stop the client from sending link elements except for rel="canonical".

Currently, here, we allow alternate, canonical, bookmark, and shortlink.

Consider two previously unannotated articles that use <link rel="alternate" type="application/rss+xml" title="Health" href="https://feeds.npr.org/1128/rss.xml">:

1 https://www.npr.org/2020/09/11/911828384/trump-says-he-downplayed-coronavirus-threat-in-u-s-to-avert-panic

2 https://www.npr.org/sections/health-shots/2020/09/11/911885577/as-covid-19-vaccine-trials-move-at-warp-speed-recruiting-black-volunteers-takes-

Annotating 1 creates document_id 22,734 and merges it with all the others.

Now I'll annotate 2 using a client that shortens the array of allowed link elements from ['alternate', 'canonical', 'bookmark', 'shortlink'] to ['canonical']. We get a fresh document_id, 1,125,647.

An alternate solution would be to keep sending these links and stop the equivalence on the back end. The client-side solution is easiest for me to demo, so that's what I'm illustrating here.

Sep 11 '20 19:09 judell

h h copied to clipboard

prevent ongoing document equivalence explosions

h
h copied to clipboard