galene icon indicating copy to clipboard operation
galene copied to clipboard

URL regex is not considering ponctuaction

Open erdnaxe opened this issue 4 years ago • 4 comments

const urlRegexp = /https?:\/\/[-a-zA-Z0-9@:%/._\\+~#&()=?]+[-a-zA-Z0-9@:%/_\\+~#&()=]/g;

This regex does not seem to always work. For example, this link is correctly considered by Github Markdown parser, but not by Galène:

  • https://example.com/LettreÀÉlise

We need to have a quite complex regex as we don't want to consider trailing dots, <> characters... If I find a better URL regex, I will post it here.

erdnaxe avatar Apr 14 '21 09:04 erdnaxe

It turned out that the problem might not come from the regex but from the fact that the regex is applied on the non-encoded URL.

This is correctly parsed by Galène : https://example.com/Lettre%C3%80%C3%89lise This is not correctly parsed : https://example.com/LettreÀÉlise

erdnaxe avatar Apr 14 '21 10:04 erdnaxe

There's the coding issue, which is due to the fact that I don't know how to do Unicode regexps in Javascript. There's also the issue of punctuation, but this one needs to preserve punctuation at the end of URLs:

I'd like you to check https://galene.org. As mentioned on https://galene.org, Pion is great. Pion (see https://pion.ly) is great.

But

Please see https://en.wikipedia.org/wiki/Silver_Streak_(film)

I need help with this.

jech avatar Apr 22 '21 14:04 jech

Found this StackOverflow post with some link to interesting libraries: https://stackoverflow.com/questions/37684/how-to-replace-plain-urls-with-links/21925491#21925491

We could use a library such as anchorme.js which seems to be rather accurate but it adds a lot of code. Maybe we would rather want something smaller but with lower accuracy? For example, do we need to check URL against IANA list? Do we need to have the list of all existing TLDs (https://github.com/alexcorvi/anchorme.js/blob/gh-pages/src/tlds.ts)?

For Unicode support, this lib seems to do this: https://github.com/alexcorvi/anchorme.js/blob/gh-pages/src/dictionary.ts#L29

If we don't need all this extra verification, I might try to do a striped down/simpler fork of anchorme.js for Galène as the code seems rather clean.

erdnaxe avatar Apr 24 '21 08:04 erdnaxe

I just noticed that my terminal emulator (Alacritty) is matching URL quite well. Looking at the code, it's using https://github.com/chrisduerr/rfind_url/ which consist of one Rust file to match URLs. It does not look that complex, but it's definitely more than just a simple regex.

erdnaxe avatar May 01 '21 13:05 erdnaxe