client icon indicating copy to clipboard operation
client copied to clipboard

Unable to annotate pages with non-Unicode encoded URLs

Open Chenastron opened this issue 2 years ago • 3 comments

I've encountered some web pages that cannot be annotated using Hypothesis, such as https://seesaawiki.jp/w/livedoor3432/d/%b2%c6%a4%b3%a4%bd%a5%c0%a5%a4%a5%a8%a5%c3%a5%c8%a1%aa. The "Search annotation" icon keeps loading and I cannot annotate. I cannot figure out the reason, please help me.

My operating system is Win11 22H2, Edge version is 109.0.1518.69. Are similar web pages like "https://seesaawiki.jp/w/livedoor3432/d/%a5%c6%a5%b9%a5%c8%a4%c7%a4%b9" that contain non-English or non-Unicode encoded characters in the URL causing the problem?

Chenastron avatar Jan 29 '23 10:01 Chenastron

Looks like the issue is to do with https://github.com/hypothesis/client/blob/290d22a8640446fb0d2e1bed93b8ce9a36bf891b/src/annotator/integrations/html-metadata.js#L63 and character encoding.

On this particular URL decodeURIComponent(document.location.href) fails. decodeURIComponent expects percent-encoded characters in the URL to be a sequence of bytes represented as %HH, where "HH" is a hex code, such that the bytes can be decoded as UTF-8 (reference). The decoded bytes here are not valid UTF-8, so it fails.

I'm not clear on why exactly decodeURIComponent is used here, but it dates back nearly 10 years (5be63e0eaf1734541ea8551e641f4e037ad274ba). My guess is that it was an attempt to normalize URIs.

We will have to be a bit careful when changing this to make sure that it gives the same result for all URLs that were previously decoded successfully, to avoid annotations going missing.

robertknight avatar Jan 29 '23 16:01 robertknight

See also https://github.com/hypothesis/client/issues/531.

robertknight avatar Apr 25 '23 14:04 robertknight

Would it be an acceptable solution to wrap the decoding in a try-catch block and on URIError we just use the original encoded URI? This way URIs that previously were successfully decoded will not be affected and annotations do not go missing.

Any issues that we will run into by not decoding the URI? In my limited testing it seems okay, I made some test annotations on example.com

tom-pj avatar Jan 27 '24 13:01 tom-pj

This should now be working. Thanks @tom-pj!

acelaya avatar Feb 23 '24 09:02 acelaya

Happy to help!

tom-pj avatar Feb 23 '24 09:02 tom-pj