terminal
terminal copied to clipboard
Evaluate different regular expressions we can use for URL detection
Environment
Windows build number: 10.0.20180.1000
Windows Terminal version (if applicable): 1.5.3142.0
Any other software? experienced in weechat running in screen on remote server connected by mosh running in WSL2 Ubuntu, though it shouldn't matter
Steps to reproduce
- Encounter
https://www.x-kom.pl/g-5/c/158-zasilacze-do-komputera.html?f[1775][52858]=1
string - Hold
ctrl
key while pointing it
Another example would be https://en.wikipedia.org/wiki/Tibia_(video_game)
Expected behavior
Whole link should be detected, underlined and followed
Actual behavior
https://www.x-kom.pl/g-5/c/158-zasilacze-do-komputera.html?f
is detected, underlined and followed
for the second example https://en.wikipedia.org/wiki/Tibia
gets detected
Ah, this is likely because of the regular expression we've used to detect URLs.
I already mentioned in feature request that it's really complex (if not impossible) issue to do this right every time, but I hope it can be made better.
Do you mind if I add other links that are not detected properly? for now I'll post it here in comment but will move to the issue later if that's fine
https://en.wikipedia.org/wiki/Tibia_(video_game)
is recognized as https://en.wikipedia.org/wiki/Tibia
and this one is especially tricky as most existing link detectors recognize it as https://en.wikipedia.org/wiki/Tibia_(video_game
More detailed discussion about parentheses, braces and so on happened under the feature request if I remember correctly
You can definitely put more mis-detected URLs here. Good finds. :smile:
I'm going to rename this to "evaluate different regexes we can use for URL detection" :smile:
Changing the expression to (\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|$!:,.;()\[\]]*)
would match these but I'm fairly sure there will be cases when it's too greedy
Overall I don't think there's a 100% bulletproof solution, there's always a chance of a typo after all, but I see at least a room for improvement in case of balanced parentheses in cases like (here's the link you asked for: https://example.com)
the closing parenthesis shouldn't be matched more often than it should and that may be the point to stop trying to make it perfect. There's also the temptation to match closing parenthesis only if it's balanced withing the link itself, though there is https://en.wikipedia.org/wiki/Emoticon_;)
, but once again: this can't be made perfect, so this approach may be more fitting and likely more performant, and we avoid the issue of parentheses being recognized as unbalanced when the opening one is out of buffer (think ([very long text here] here's the link you asked for: https://example.com)
)
Another case from #8907:
https://⌘v.ircd.xyz/o.gif
I'm not sure if our pattern is matching unicode characters super well (or if it even can)
Noticed https://duckduckgo.com/?q=[github]&t=ffab&ia=web is encoded to https://duckduckgo.com/?q=%5Bgithub%5D&t=ffab&ia=web on paste but https://⌘v.ircd.xyz/o.gif is not encoded as https://xn--v-cqo.ircd.xyz/o.gif as it does on a browser.
coming back with another case:
https://pl.wikipedia.org/wiki/Izolator_Boguchwała apparently can appear in text and will be handled properly by browsers but is parsed as https://pl.wikipedia.org/wiki/Izolator_Boguchwa by WT
(\b(https?|ftp|file)://[-\p{L}\p{M}*+0-9+&@#/%?=~_|$!:,.;()\[\]]*)
would work in Golang, but once again, it should probably be modified for C++ and evaluated against existing test cases
The TLDR:
There can be problems with this feature involving who is doing what and when. The underlining is essentially a hint from Terminal. The hinting can't be that an Internet Resource Identifier (URI, including URL) is intended, only that there is a resemblance that is carefully touched. It is important not to have this feature be over-presumptive or to do too much work in that respect absent any act on the part of an user.
SOME CONTEXT (longer)
There are a couple of things to be dealt with here. First, URI's are defined entirely using ISO-646 (ASCII) 7-bit codes. These days, that means Unicode page U+0000 to U+007F, not all of which are usable. The generic term is URI, which subsumes URL, and I will use that now,
The idea was that URIs could be communicated carefully in email and text files. A text video case also applies. The specific rules currently in effect (with some supplements) are found in RFC3986.
There are interfaces, these days, that hide the %-encoding by which other codes (such as UTF8 octets greater than 0x7f
) and also the ones that are used in URI structure can be passed through such that the URI parsing remains correct. It is not exactly appropriate for Terminal to deal with that as some sort of automatic presumption. It needs to show the text that it is presented. Whether it does %-encoding and other conversions when the CTRL-Click is done is probably appropriate. But not before. This depends on how "press Control-Click to follow ... " is implemented.
There are scheme issues also. http: and https: are well-known URI schemes. When file: is used as a scheme we don't exactly have the http: scheme rules anymore. Ditto mailto: and others, urn:, etc.
These are all reasons for Terminal to do underlining carefully and essentially as a hint. There is not context to determine whether a specific usable URI is intended or not For example, It is not going to figure out, when a compiler error stream is sent through it, how those filename and line-number diagnostics be links into the app that produced the stream, so an IDE can provide views of the files the compiler execution processed (not to mention with regard to pre-processing as well).
In any case, there are more layers here than tend to be observable and it is easy to confuse what some application accepts as a "URI/URL" and what it actually does in conforming to the applicable URI specification. So what has been provided as an user-ciovenience becomes an abstraction pitfall when talking about requirements on a service such as Terminal.
There is some useful information about the technical issues starting in section 2.3 at RFC3305. (Notice how this is clickable or not depending on how this comment is being viewed. RFC7595 indicates some things about generic cases and even registration of additional scheme.
There are, as I said, ways to ensure that not too much of the text is taken to be the URI, and that is by using surround space (spaces not allowed in the URI directly) or using something like parentheses, ( ... )
to bracket the URI, the latter being how that is exactly handled in Markdown, including the raw Markdown of this comment.
When file: is used as a scheme we don't exactly have the http: scheme rules anymore.
Especially on Windows, ironically. Because Windows uses \
as its path separator, it's not currently recognized fully. For example with file:///C:\Users\Yay295
only file:///C
is underlined. On the other hand if I change the slashes, the entirety of file:///C:/Users/Yay295
is recognized.
Spaces also break the URI detection. This would would be harder to deal with, but I don't know if it would be possible for file:
URI's at least to check if the path exists first and if it doesn't, see if matching more text finds a path? I imagine that could be difficult with network paths though, and it's definitely more than just using a regex.
can an url contain whitespace? if not, then everything from http to first whitespace should be treated as url. or, until a more sophisticated solution is tuned to catch every case, there should be an option in settings: "[x] treat everything from http to first whitespace as url, I accept this is a risk and not all strings might be a correct url."
that may be a reasonable idea actually, fast and basically covers everything that can be covered with more complex options, even if it can sometimes get a bit too eager that should be fine
URL detection regex should consider the protocol prefix (http://
, https://
) optional, and should default to prepending https://
when protocol is missing, or perhaps allowing default protocol to be configured. A lot of terraform/terragrunt code uses custom module source URL syntax that strips off the proto, e.g.:
source = "github.com/lgallard//terraform-aws-secrets-manager?ref=0.6.1"
..and it'd be nice to be able to click this to read the docs on the module.
As described here the above module source URL syntax is a "convenient alias" for the general Git repository address scheme, which, if used, would yield a clickable URL in MS Terminal:
source = "git::https://github.com/lgallard//terraform-aws-secrets-manager?ref=0.6.1"
I just wanted to mention the original problem URL syntax since the users of terraform code often don't own the code, and aren't in control of the syntax they use.
Would be great if the regex was configurable.
In an URL detection regex that I've been using on IRC for years, I do:
- Start with a word boundary
- Look for
https?://
orwww.
- At the end of the URL, greedily consume phrases followed by parentheses, e.g.
hello(world)
, and after, non-greedily other suitable characters - Using lookahead, confirm that the URL ends with a space or EOL, with some possible punctuation before that, e.g.
[\]>,.)!?:'"”@]*
This works well enough when the URL is inside quotes, parentheses, and also handles the Wikipedia URLs that has quotes within them
I'm trying to get clickable links to work with the file:
scheme and umlauts (non-ASCII) characters in the file name. Unfortunately, as explained above, it doesn't work as hoped for. Two different solutions would solve my problem:
- Make the Regex configurable or at least offer regex'es that are not as limited as the current one. Currently, the URL
file://C:/ümlaut.txt
is only used up to the last/
, which is not what I need. - Alternatively, support percent-encoded UTF-8 file names. WT currently opens exactly those file names as are on the screen. So if I encode the URL as
file://C:/tmp/%C3%BCmlaut.txt
, then the regex is happy, but WT tries to openC:\tmp\%C3%BCmlaut.txt
(with the percentages intact) and that obviously doesn't work.
Not sure if this warrants its own issue, but it appears to be directly related to this issue:
Note in the image that the hyperlink is surrounded by quotation marks on the host screen, but clicking the link results in the browser opening a url formatted like so <url>"
incorrectly parsing one of the quotation marks surrounding the URL as a part of it.