wezterm URL with port specifier not recognized as URL

Describe the bug

Some URLs are not recognized as URLs (e.g. not clickable).

Environment (please complete the following information):

OS: Linux Wayland (Wezterm running under XWayland)
Version: 20210502-154244-3f7122cb
The active keyboard layout name: US

To Reproduce

From the output of jupyter lab --no-browser (over SSH):

$ jupyter lab --port=9000 --no-browser
[I 2021-07-07 16:44:06.893 ServerApp] jupyterlab | extension was successfully linked.
[I 2021-07-07 16:44:07.231 ServerApp] jupyter_nbextensions_configurator | extension was found and enabled by nbclassic. Consider moving the extension to Jupyter Server's extension paths.
[I 2021-07-07 16:44:07.231 ServerApp] jupyter_nbextensions_configurator | extension was successfully linked.
[W 2021-07-07 16:44:07.231 ServerApp] The module 'ipyparallel.nbextension' could not be found. Are you sure the extension is installed?
[I 2021-07-07 16:44:07.231 ServerApp] nbclassic | extension was successfully linked.
[W 2021-07-07 16:44:07.267 ServerApp] jupyter_nbextensions_configurator | extension failed loading with message: 'nbextensions_path'
[I 2021-07-07 16:44:07.268 LabApp] JupyterLab extension loaded from /users/tiziano/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jupyterlab
[I 2021-07-07 16:44:07.268 LabApp] JupyterLab application directory is /users/tiziano/.pyenv/versions/3.9.5/share/jupyter/lab
[I 2021-07-07 16:44:07.271 ServerApp] jupyterlab | extension was successfully loaded.
[I 2021-07-07 16:44:07.279 ServerApp] nbclassic | extension was successfully loaded.
[I 2021-07-07 16:44:07.281 ServerApp] Serving notebooks from local directory: /users/tiziano
[I 2021-07-07 16:44:07.281 ServerApp] Jupyter Server 1.7.0 is running at:
[I 2021-07-07 16:44:07.281 ServerApp] http://localhost:9000/lab?token=a7a5f779e0a4f5097e4dfe4b7d9e4a6a794a30ad8ba0b6ff
[I 2021-07-07 16:44:07.281 ServerApp]     http://127.0.0.1:9000/lab?token=a7a5f779e0a4f5097e4dfe4b7d9e4a6a794a30ad8ba0b6ff
[I 2021-07-07 16:44:07.281 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2021-07-07 16:44:07.287 ServerApp]

    To access the server, open this file in a browser:
        file:///users/tiziano/.local/share/jupyter/runtime/jpserver-3401-open.html    Or copy and paste one of these URLs:
        http://localhost:9000/lab?token=a7a5f779e0a4f5097e4dfe4b7d9e4a6a794a30ad8ba0b6ff
        http://127.0.0.1:9000/lab?token=a7a5f779e0a4f5097e4dfe4b7d9e4a6a794a30ad8ba0b6ff
[I 2021-07-07 16:44:27.845 LabApp] Build is up to date

The file:/// URL is correctly recognized as such and clickable, the http:// URLs are not (none of them).

Jul 07 '21 15:07 dev-zero

If the built-in rules aren't matching, you can define your own: https://wezfurlong.org/wezterm/hyperlinks.html#implicit-hyperlinks If you can nail down what's not matching, please submit a pull request to fix it: this is the default in the code: https://github.com/wez/wezterm/blob/main/config/src/lib.rs#L1742

Jul 07 '21 15:07 wez

Ok, I think I see why it is not matching. How general vs costly should the general regex be? According to this SO answer the full universal RFC compatible regex is a bit "extensive".

Jul 07 '21 15:07 dev-zero

Cheaper is better! I don't think this regex needs to be perfect, just to match the most common/useful things out of the box

Jul 07 '21 17:07 wez

I was surprised when output from my local development server with a local address wasn't clickable. It is indeed because the built in URL matcher doesn't work with ports. I added this to my wezterm config file:

 hyperlink_rules = {
        -- Linkify things that look like URLs
        -- This is actually the default if you don't specify any hyperlink_rules
        {
          regex = "\\b\\w+://(?:[\\w.-]+)(?:(:?:\\.[a-z]{2,15}\\S*)|(?::\\d{1,5}))\\b",
          format = "$0",
        },

        -- linkify email addresses
        {
          regex = "\\b\\w+@[\\w-]+(\\.[\\w-]+)+\\b",
          format = "mailto:$0",
        },

        -- file:// URI
        {
          regex = "\\bfile://\\S*\\b",
          format = "$0",
        },
  }

The regex was taken from @matyklug18 's patch above.

Jun 10 '22 22:06 Fingel

In addition to the issues described by this issue and its referenced issues, URLs ending with ) e.g. https://en.wikipedia.org/wiki/Class_(set_theory) will not include the trailing closing ) in the clicked result. Likewise for URLs ending with - as in, e.g., https://www.onlinegdb.com/fork/Jd2x8Yvh-. Some websites require the closing / on URLs, e.g. some cgit repositories.

After a lot of tweaking and experimentation, I've settled on these two regexes for URLs. I specifically tried to make the regexes as simple as possible (using \w and \S) so they would be more performant. Note that \w+:// will match http://, https://, file://, ftp://, nntp://, gopher://, irc://, etc, all with one regex without additional file://-specific, etc, regexes.

    hyperlink_rules = {
        -- First handle URLs wrapped with punctuation (i.e. brackets)
        -- e.g. [http://foo] (http://foo) <http://foo> etc
        -- the punctuation will be underlined but excluded when clicked
        {
            regex = '[[:punct:]](\\w+://\\S+)[[:punct:]]',
            format = '$1',
        },

        -- Then handle URLs not wrapped in brackets
        -- and include terminating ), / or - characters, if any
        -- these seem to be the most common trailing characters that are part of URLs
        -- there may be additional common ones. . . 
        {
            regex = '\\b\\w+://\\S+[)/a-zA-Z0-9-]+',
            format = '$0',
        },
    },

I have tested these two regexes extensively for the last week or so on busy IRC channels and they seem to work much, much better than the default regexes. With the defaults, around 20% to 40% of hyperlinks would need manual intervention to add a missing trailing /, ) or -, or to highlight and copy the hyperlink because the default regex stopped prematurely at an earlier word-boundary. With the above two regexes, 99% of hyperlinks are clickable. As for the rare 1% of cases, it unfortunately doesn't flawlessly handle messages like:

(test if your terminal supports sixels before anything: https://github.com/csdvrx/sixel-testsuite)

It ends up making https://github.com/csdvrx/sixel-testsuite) the clickable URL, with the ) attached to the end. :-( Hopefully someone has an idea to improve this situation. Fortunately, this case is relatively rare because most people know to put a space after URLs so trailing punctuation doesn't become clickable (or to wrap the URL with brackets). MOST people! :-)

Additionally, with first regex for handling wrapped punctuation/brackets, only the captured $1 region will be sent to the browser, but wezterm will still underline the outside punctuation/brackets anyway. I'm hoping that wezterm can be patched to underline only the captured region(s) from the capture group(s).

Feb 18 '23 23:02 pragma-

I think the reason why not only the captured region is captured is because of this line: https://github.com/wez/wezterm/blob/23211fc8a5df8ee3c491450f0fab54655c431ca9/termwiz/src/hyperlink.rs#L238 Capture 0 gets the entire match regardless of capture regions.

Feb 20 '23 03:02 Funami580

re: punct version, what do you think about splitting it up into explicit regexes for the various pairs of brackets?

        {
            regex = '\\((\\w+://\\S+)\\)',
            format = '$1',
        },
        {
            regex = '\\[(\\w+://\\S+)\\]',
            format = '$1',
        },
        {
            regex = '<(\\w+://\\S+)>',
            format = '$1',
        },

Mar 17 '23 01:03 wez

I've made some updates to incorporate your suggested rules, and added a way to select which capture group is highlighted. https://wezfurlong.org/wezterm/config/lua/config/hyperlink_rules/ has more information about this.

It typically takes about an hour before commits are available as nightly builds for all platforms. Linux builds are the fastest to build and are often available within about 20 minutes. Windows and macOS builds take a bit longer.

Please take a few moments to try out the fix and let me know how that works out. You can find the nightly downloads for your system in the wezterm installation docs.

If you prefer to use packages provided by your distribution or package manager of choice and don't want to replace that with a nightly download, keep in mind that you can download portable packages (eg: a .dmg file on macOS, a .zip file on Windows and an .AppImage file on Linux) that can be run without permanently installing or replacing an existing package, and can then simply be deleted once you no longer need them.

If you are eager and can build from source then you may be able to try this out more quickly.

Mar 17 '23 04:03 wez

re: punct version, what do you think about splitting it up into explicit regexes for the various pairs of brackets?

Good idea. I'd actually been considering replacing [[:punct:]] with opening/closing character classes like [(\[<] and [)\]>] because [[:punct:]] could potentially be very large, especially if it includes all of Unicode punctuations.

What I'm wondering now is whether a single regex using character classes for the three kinds of brackets, like [(\[<](\w+://\S+)[)\]>], would be more performant than three separate regexes.

If using the character classes, I suppose there is the valid concern about the opening bracket not lining up with the closing bracket. For what it's worth, for the last month or so that I've been using the [[:punct:]] variant, I haven't encountered any annoyances with the brackets. The character class variant may be accurate enough while being much more performant than three distinctive regexes, perhaps.

I'm not sure what kind of optimizations the regex crate provides. I did notice the RegexSet feature in the documentation: https://docs.rs/regex/latest/regex/#example-match-multiple-regular-expressions-simultaneously -- I'm unsure if this just iterates over each regex or if it actually optimizes/merges the regexes into one. It does say "in a single scan" ...

Mar 17 '23 18:03 pragma-

regex set can match multiple at once, however, when doing substitutions and extracting captures, we still need to run the individual regexes. Until regex shows up near the top of a profile, I'm not concerned about the cost of a couple of more regexes.

Mar 17 '23 18:03 wez

Just got around to trying out the nightly. The highlight option seems to solve underlining the desired captured group. Thanks!

Mar 19 '23 03:03 pragma-

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Apr 21 '23 03:04 github-actions[bot]

wezterm wezterm copied to clipboard

URL with port specifier not recognized as URL

Describe the bug

Environment (please complete the following information):

To Reproduce

wezterm
wezterm copied to clipboard