linkify icon indicating copy to clipboard operation
linkify copied to clipboard

Linkify punycode-encodes em-dash

Open alexwennerberg opened this issue 3 years ago • 5 comments

Hi! Thanks for this library -- I use it in my new mailing list software to detect links in emails. Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Here's my code: https://git.alexwennerberg.com/crabmail/file/src/utils.rs.html#l22

I think that this could unambiguously be parsed, but I'm not 100% sure. What do you think?

alexwennerberg avatar Dec 26 '21 16:12 alexwennerberg

Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Hmm that link doesn't load for me, could you provide a copy here?

robinst avatar Feb 11 '22 05:02 robinst

Ah, sorry -- I shuffled things around a bit on my site. Here's the fixed link:

https://lists.flounder.online/crabmail/threads/[email protected]

alexwennerberg avatar Feb 11 '22 05:02 alexwennerberg

I see. That's an interesting case, because can currently be part of an URL, e.g. like this:

https://www.example.com/—

In that case, the whole text including em-dash would get linked.

Also note that GitHub behaves the same way here:

https://www.example.com— https://www.example.com/—

We could fix the case where it's part of the domain, see also #29 which has some discussion around that. But what would you expect with the case where it's part of the path?

robinst avatar Feb 11 '22 06:02 robinst

I think that if it's part of the path it should be treated as such. I guess this is a broader question, whether this library should reject invalid TLDs? like:

https://lists.flounder.online/test/threads/[email protected]

I think that the tradeoffs that you've made with the library as written are reasonable though

alexwennerberg avatar Feb 11 '22 06:02 alexwennerberg

I've reworked domain parsing in 0.9.0 (see https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11), but I haven't addressed this yet.

I think we could now do this by rejecting TLDs that contain non-alphanumeric Unicode characters. Note that there are TLDs that contain non-ASCII characters, see examples here (but they would be alphanumeric): https://en.wikipedia.org/wiki/Internationalized_country_code_top-level_domain

robinst avatar Jul 11 '22 05:07 robinst