orbit icon indicating copy to clipboard operation
orbit copied to clipboard

Strip anchor tags from provenance URLs / canonicalize URLs

Open andymatuschak opened this issue 3 years ago • 3 comments

We store provenance information when users collect prompts. We use this for analytics and will eventually use this to group "same-page" prompts in the interface. If the author provides a canonical URL, we'll use that. But if they don't, we just look at the current page URL. Ideally we want to make this as canonical as possible. If someone visits http://foo.com/essay#some-header, we should store http://foo.com/essay as the URL, stripping the anchor.

Are there general libraries / algorithms / heuristics for fuzzily canonicalizing URLs?

andymatuschak avatar Jan 27 '21 02:01 andymatuschak

Also this kind of nonsense is showing up:

https://andymatuschak.org/prompts/?ck_subscriber_id=1121236996&utm_source=convertkit&utm_medium=email&utm_campaign=Creating+Habits+%F0%9F%A7%A4%20-%205117179

andymatuschak avatar Feb 03 '21 22:02 andymatuschak

Not quite sure how to deal with that. In some cases, the query nonsense is meaningful. Bluh.

andymatuschak avatar Feb 03 '21 22:02 andymatuschak

There is some work being done on URL normalization here for a chrome extension. The maintainer is planning to make it a library. See the write-up

c1-g avatar May 17 '22 06:05 c1-g