h icon indicating copy to clipboard operation
h copied to clipboard

Migrate incorrect Via 3 PDF document URLs

Open robertknight opened this issue 4 years ago • 3 comments
trafficstars

As a result of https://github.com/hypothesis/via3/issues/372 all existing annotations created on PDFs using Via3 before 2021-04-06 have an incorrect target_uri field for the annotation. Instead of being the original URL of the PDF it is Via3 URL which proxies the PDF. In other words rather than https://example.com/file.pdf the target_uri is https://via3.hypothes.is/proxy/static/https://example.com/file.pdf or similar. We went through several different domains while working on a replacement for Via, including at least via3.hypothes.is and via4.hypothes.is.

This issue was being fixed by https://github.com/hypothesis/via3/pull/427 but existing annotations will need to be updated via a data migration or some other mechanism.

Fixing this issue will resolve a problem where the document details show up incorrectly in the Notebook and activity pages (https://hypothes.is/search) for annotations created with Via 3:

Notebook wrong PDF URL

Be aware that it is not just the annotation.target_uri field which matters here. There are corresponding fields in the document_uri and document tables as well.

robertknight avatar Mar 31 '21 10:03 robertknight

CC @esanzgar as we talked about this on Slack recently.

robertknight avatar Mar 31 '21 10:03 robertknight

One way that you might think we could fix this is by updating the URI normalization algorithm, which already does de-Via-fication for legacy Via URLs. However changing the algorithm doesn't update existing document URIs and more generally we don't have a procedure to do that. See https://github.com/hypothesis/h/issues/6552.

robertknight avatar Mar 31 '21 11:03 robertknight

Querying a random sample of the annotation table in production results in an estimate of ~4M annotations that have URLs referencing via3.hypothes.is or via4.hypothes.is.

select 100 * count(*) from annotation tablesample system(1) where target_uri LIKE 'https://via3.hypothes.is/%' OR target_uri LIKE 'https://via4.hypothes.is/%'

robertknight avatar Apr 01 '21 16:04 robertknight