linkify
linkify copied to clipboard
Fix URL parser for parentheses and Chinese Characters
Currently, URLs like http://foo.com)你好 will be interpreted as is. Instead, it should be separated into http://foo.com and )你好 (according to URL conventions).
Although parentheses are by definition allowed in URLs, they usually appear encoded. So parentheses that appear at the end of a URL are, for most of the time, not part of it.
Chinese characters are simply not allowed in URLs.
This PR fixes those issues.
May I suggest adding a test to reproduce that? Just to be sure it never breaks in the future.
Sorry, I'm quite new to GitHub and it looks like I'm not quite aware of how PRs and tests work.
The last 5 "experimental" commits were not meant for merging. I was trying to find the best Regex that meets the need of my forum (where users may post links). Although I've found the best Regex for my app, it may not be the best for general purposes and thus might not be suitable for merging.
Since this is only a simple Regex change, I suggest closing this PR and manually "merge" this commit, if needed.
As for tests, since I'm not familiar with how it works..., could you please provide instructions on how to add one? Sorry for troubling you...
Hey, thanks for the contribution! I could do that, or if you wish I can walk you through bring up your PR to something we can merge together (and you learn!). Let me know!
Thank you very much for your kind offer :) Where should we begin?
After experimenting with the URL parser, I came up with several ideas:
Unexpected Unicode Characters
Currently, an URL element will end only when there is a whitespace character matched by \s. This will cause the URL parser to match unexpected characters (like Chinese ones).
According to RFC3986, encoded URLs may only contain the following characters:
%:/?#[]@!$&'()*+,;=-._~
plus ALPHA and DIGIT
which translates to [%:/\?#\[\]@!\$&'\(\)\*\+,;=-\.~\w] in Regex.
So the _urlRegex constant can be safely updated to
final _urlRegex = RegExp(
r'^(.*?)((?:https?:\/\/|www\.)[^\s/$.?#].[%:/\?#\[\]@!\$&'\(\)\*\+,;=\-~.\w]*)',
caseSensitive: false,
dotAll: true,
);
(not sure why but it looks like the syntax is a little different in Dart, why does - needs escaping while . does not?)
This will prevent unexpected unicode characters from being matched into our LinkElement.
Parentheses
Parentheses appear as sub-delimiters in RFC3986, so they are technically allowed as a part of the URL. But in practical context they are often confused with the surrounding text (e.g. in Markdown, where links always ends with )). Previously I have suggested removing them from the URL Regex, but that's not ideal and only serves as a workaround. There should be better ways to handle this. For example, since parentheses usually appear in pairs, maybe we can count the number of ( before the LinkElement and match it against the ) at end of LinkElement?