julia icon indicating copy to clipboard operation
julia copied to clipboard

Add filesystem func to transform a path to a URI

Open tecosaur opened this issue 1 year ago • 13 comments

In a few places across Base and the stdlib, we emit paths that we like people to be able to click on in their terminal and editor. Up to this point, we have relied on auto-filepath detection, but this does not allow for alternative link text, such as contracted paths.

Doing so (via OSC 8 terminal links for example) requires filepath URI encoding.

This functionality was previously part of a PR modifying stacktrace printing (#51816), but after that became held up for unrelated reasons and another PR appeared that would benefit from this utility (#55335), I've split out this functionality so it can be used before the stacktrace printing PR is resolved.

tecosaur avatar Aug 11 '24 05:08 tecosaur

This seems good on *nix machines, I could do with a little help working out what should happen with windows drive path components though.

Edit: I think I've worked it out.

tecosaur avatar Aug 11 '24 07:08 tecosaur

Should this call expanduser first?

I'm not sure on this. I wish otherwise but get the impression that most of the Base.Filesystem functions don't auto-expand ~, and so it may be most consistent to not.

tecosaur avatar Aug 12 '24 03:08 tecosaur

Other than that, it just looks like I need to fix the Windows tests. It looks like uripath doing the right thing though:

  Expression: Base.Filesystem.uripath("$(absdrive)some$(sep)file.txt") == "file://$(host)/$(drive)some/file.txt"
   Evaluated: "file://win2k22-amdci6-4/C%3a/some/file.txt" == "file://win2k22-amdci6-4/C:\\some/file.txt"
               ^ looks good to me                              ^ needs updating

tecosaur avatar Aug 12 '24 03:08 tecosaur

I don't think it should add the hostname into the URI if the user provided path doesn't have one. That is certainly true on Windows, but I also don't think it should be done on Unix/Mac.

On Windows, there are local paths and UNC paths. A URI with a hostname converts to a UNC path. So, the current implementation doesn't round trip: if one starts with a local path, converts to a URI and then back to path, one ends up with a UNC path. Plus, such a UNC path will be invalid, one can't just construct a valid UNC path by concatenating a hostname with a local file path (UNC paths are based on shares).

The percent encoding should use upper-case letters (https://www.rfc-editor.org/rfc/rfc3986#section-2.1).

I would just mimic the VS Code implementation https://github.com/microsoft/vscode-uri/blob/6dec22d7dcc6c63c30343d3a8d56050d0078cb6a/src/uri.ts#L302 :)

It might also be nice to have an optional flag where one can specify whether the path is a windows path or a unix path, so that the platform specific behavior is the default, but one also has access to the other form.

If the hostname is added (which I think it shouldn't be) then it should probably also be encoded?

I don't understand why the encoding is done per path segment? Why not just call encode on the whole path component as a whole?

davidanthoff avatar Aug 13 '24 18:08 davidanthoff

I don't think it should add the hostname into the URI if the user provided path doesn't have one. That is certainly true on Windows, but I also don't think it should be done on Unix/Mac.

That is contrary to the impression I have from reading the referenced resources, for example the hostname component is mandatory in RFC1738. Yes, file:/// works on many systems but that's a shorthand in the Freedesktop File URI spec. However, I've seen this recommended against on the basis that both (1) it's a common extension of the RFCs, but not in keeping with them and (2) you can't just assume a link is always displayed on the same machine that has the file, people do use SSH sessions.

The percent encoding should use upper-case letters (https://www.rfc-editor.org/rfc/rfc3986#section-2.1).

Ta.

It might also be nice to have an optional flag where one can specify whether the path is a windows path or a unix path, so that the platform specific behavior is the default, but one also has access to the other form.

I'd be inclined to wait till a use case for this appeared. This is a private API currently, so we're not locking ourselves into anything.

If the hostname is added (which I think it shouldn't be) then it should probably also be encoded?

Ah yep, that was an oversight. Thanks.

I don't understand why the encoding is done per path segment? Why not just call encode on the whole path component as a whole?

Because that's how I thought to do it, that also seems like a nice approach, I'll just add / to the regex ignore list.

tecosaur avatar Aug 14 '24 04:08 tecosaur

the hostname component is mandatory in RFC1738

No, that is not correct :) https://www.rfc-editor.org/rfc/rfc1738#section-3.10 is pretty clear that <host> can be the empty string, and then you end up with file:/// according to the spec.

The key thing is that on Windows things will just not work if you put in the hostname when you start with a regular local path. If you start with a path C:\foo\bar.txt on a computer with hostname MYCOMPUTER and use the code in this PR to create a URI you get file://MYCOMPUTER/C%3A/foo/bar.txt. The correct way to turn that into a path then gets you \\MYCOMPUTER\C:\foo\bar.txt, which is a completely different path from what we started with (in fact, it appears like a UNC path but is not a valid one). The right way to do this (on Windows) is to check whether your original path is a local path or a UNC path. If local, leave the hostname empty, if it is a UNC path, parse it into the hostname and the rest of the path and then construct a URI with a hostname and path element. If Julia were to output URIs with a hostname for local paths, none of the links that the existing terminals in say VS Code or the default Windows Terminal create for that will work.

I'm less familiar with Linux/Mac, but it seems to me that given that the URI spec allows for representation of paths with and without a hostname, we should just faithfully convert paths that start without a hostname to a uri without a hostname, and paths with a hostname to a uri with a hostname. That certainly seems to be the way other languages do it.

Oh, and another thing: the set of characters that need to be percent encoded is different for different URI components. I don't remember whether the authority and the path component have the same set, but it probably would be good to check because one might not be able to use exactly the same algorithm for the two parts.

davidanthoff avatar Aug 14 '24 07:08 davidanthoff

No, that is not correct :) https://www.rfc-editor.org/rfc/rfc1738#section-3.10 is pretty clear that can be the empty string, and then you end up with file:/// according to the spec.

Ah, so it is! I'd focused on the BNF in section 5, which has:

host           = hostname | hostnumber
fileurl        = "file://" [ host | "localhost" ] "/" fpath

As an aside, I've just noticed we probably want to reference https://www.rfc-editor.org/rfc/rfc8089 instead of RFC1738 (it lists an empty host uri as an example in Appendix B, FYI).

The key thing is that on Windows things will just not work if you put in the hostname when you start with a regular local path. If you start with a path C:\foo\bar.txt on a computer with hostname MYCOMPUTER and use the code in this PR to create a URI you get file://MYCOMPUTER/C%3A/foo/bar.txt. The correct way to turn that into a path then gets you \\MYCOMPUTER\C:\foo\bar.txt, which is a completely different path from what we started with (in fact, it appears like a UNC path but is not a valid one).

Ok, sounds like we might want to treat Windows specially based on what you're saying, but there are two aspects to this I'd like to raise first:

  1. Regardless of UNC-path-ness, the file://MYCOMPUTER/... url seems completely valid to me based on the relevant specs. Or is there something I'm missing?
  2. From a glance at https://learn.microsoft.com/en-us/dotnet/standard/io/file-path-formats#unc-paths, it seems as though \\MYCOMPUTER\C$\foo\bar.txt would be a valid UNC path (as long as gethostname() provides the NetBIOS machine name).

If Julia is supposed to accept UNC paths, it seems like we'd need to be able to split a windows path into three components: server, share, path.

If Julia were to output URIs with a hostname for local paths, none of the links that the existing terminals in say VS Code or the default Windows Terminal create for that will work.

I'll happily take your word on this, but I'd be interested in hearing how testing this goes if you haven't already.

I'm less familiar with Linux/Mac, but it seems to me that given that the URI spec allows for representation of paths with and without a hostname, we should just faithfully convert paths that start without a hostname to a uri without a hostname, and paths with a hostname to a uri with a hostname.

Unix paths don't have a host component the way that Windows UNC paths seem to.

That certainly seems to be the way other languages do it.

Other languages can have libraries that are able to infer whether a hostname component is appropriate or not? I'd be interested to see that.

Anyway, while it sounds like Windows might be a bit more particular, the *nix machines I've used have been perfectly happy with hostname-form file:// URIs. At least for *nix machines then, I'm inclined to take the approach that it's better to have the hostname component and not need it, that need it and not have it.

Oh, and another thing: the set of characters that need to be percent encoded is different for different URI components. I don't remember whether the authority and the path component have the same set, but it probably would be good to check because one might not be able to use exactly the same algorithm for the two parts.

From my reading of the last paragraph of https://www.rfc-editor.org/rfc/rfc3986#section-3.2.2 and https://www.rfc-editor.org/rfc/rfc3986#section-3.3 it seems like percent-encoded octets is appropriate for both, skipping the characters listed in https://www.rfc-editor.org/rfc/rfc3986#section-2.3.

tecosaur avatar Aug 14 '24 08:08 tecosaur

file://MYCOMPUTER/... url seems completely valid to me based on the relevant specs

Yes, it is a valid URI that decodes to a UNC path.

it seems as though \\MYCOMPUTER\C$\foo\bar.txt would be a valid UNC path

Yes, those are so called administrative shares. They exist by default if network sharing has been enabled on a machine, and only administrators can access them.

If Julia were to output URIs with a hostname for local paths, none of the links that the existing terminals in say VS Code or the default Windows Terminal create for that will work.

I'll happily take your word on this, but I'd be interested in hearing how testing this goes if you haven't already.

I just tried it in Windows Terminal. It handles file URIs without an authority just fine, and for URIs with an authority it says that those link types aren't supported.

In the VS Code terminal many more things work. file://DANTHOFF-HOME/C$/Users will open the network share, i.e. it turns that into an UNC path and then opens that. file://DANTHOFF-HOME/C:/Users (surprisingly) also works, and it actually does not turn that into a UNC path, but instead seems to recognize that the hostname matches the local hostname, and then just access things via the local file path. To me that seems a bit weird, to be perfectly honest, I think they must have some logic when to turn an URI with a hostname into a local file path vs UNC path that is not entirely clear to me.

I just stumbled over another thing: I think some of the RFCs seem to suggest that the hostname must be a fully qualified domain name or an IP address, not just a hostname?

davidanthoff avatar Aug 14 '24 20:08 davidanthoff

Yes, those are so called administrative shares. They exist by default if network sharing has been enabled on a machine, and only administrators can access them.

:+1: good to know

I just tried it in Windows Terminal. It handles file URIs without an authority just fine, and for URIs with an authority it says that those link types aren't supported. In the VS Code terminal many more things work [...] I think they must have some logic when to turn an URI with a hostname into a local file path vs UNC path that is not entirely clear to me.

This is very useful to hear.

I just stumbled over another thing: I think some of the RFCs seem to suggest that the hostname must be a fully qualified domain name or an IP address, not just a hostname?

This is a bit muddy. RFC8089 indeed describes it as a fully qualified domain name or IP, but then in Appendix E.3 on UNC paths, talks about using the UNC hostname in place of a fully qualified domain name.

That said, when including the hostname I paid particular attention to the Freedesktop URI spec, which includes:

Hostnames
---------

When generating a file: uri the hostname part, if nonempty, should be
whatever is returned from gethostname(). This means that the name is
canonical for all users on the same machine, so that you can easily
see if the referenced file is on the current machine. Note that
"localhost" or an empty hostname needs to be handled specially, always
meaning the host the uri is being interpreted on.

In some cases gethostname() will happen to be a fully qualified domain name, e.g. on the CI runner one of the tests produces file://amdci4.julia.csail.mit.edu/some%20file%20with%20%5Eodd%25%20chars.

So, the current approach I've taken (see recent force pushes) is to:

  • Use gethostname() as the hostname on *nix machines
  • Extract the hostname component from UNC paths on Windows
  • Use an empty hostname for local paths on Windows

From our conversation and the specs, this seems to me like a good enough approach to try initially.

tecosaur avatar Aug 15 '24 05:08 tecosaur

From our conversation and the specs, this seems to me like a good enough approach to try initially.

Agreed!

I stumbled over another interesting question: what to do about WSL (Windows Subsystem for Linux), i.e. if a user runs Julia in WSL on Windows. They will be running a Linux build of Julia, and Julia will essentially run in a Linux VM, but the terminal is running on Windows. The links generated right now won't work in that Windows environment. Windows Terminal does allow for two specific hostnames in file URIs, namely wsl$ and wsl.localhost (https://github.com/microsoft/terminal/pull/14993). In an ideal world we would detect that case, and then output URIs that use one of those hostnames. There is another wrinkle, namely the URI would also have to include the distribution name. From within WSL one can run wslpath -m <PATH> to generate a path that works from Windows to access the WSL file system, and so if one turns that into a file link, everything should work well. In theory ;)

davidanthoff avatar Aug 15 '24 19:08 davidanthoff

Hmmm, that is an annoying wrinkle. From a quick read online it seems like checking for the existence of /proc/sys/fs/binfmt_misc/WSLInterop might be the way we should check for WSL. It's what snapd uses, and seems to be used widely beyond it.

There is another wrinkle, namely the URI would also have to include the distribution name. From within WSL one can run wslpath -m <PATH> to generate a path that works from Windows to access the WSL file system, and so if one turns that into a file link, everything should work well. In theory ;)

If you can give me more detail on what this looks like in practice, then I'll see about doing this :slightly_smiling_face:

tecosaur avatar Aug 24 '24 06:08 tecosaur

Actually, I think I might have figured it out.

tecosaur avatar Aug 24 '24 07:08 tecosaur

Running wslpath -m / inside WSL returns a path that includes the hostname and the distribution name. So I guess one way would be to run that once, cache the results and use that? Or maybe find the source for that and see what they are doing...

davidanthoff avatar Aug 25 '24 16:08 davidanthoff

When this is ready it would be good to use it for the generic branch here https://github.com/JuliaLang/julia/blob/bada969c14fd8f58804b662f5cca44808bb52433/stdlib/Profile/src/Profile.jl#L827 that just went in in https://github.com/JuliaLang/julia/pull/55335

Also, if there are any improvements to be made to the IDE-specific formats above that might have been garnered from this PR work.

IanButterworth avatar Sep 05 '24 00:09 IanButterworth

I propose we merge this. It seems to me handling WSL edge cases can be future improvements?

IanButterworth avatar Sep 24 '24 17:09 IanButterworth

From the docs I've read, the code I've written should handle WSL, but there's nothing like somebody actually testing it.

It can always be improved in follow-up PRs :slightly_smiling_face:

tecosaur avatar Sep 24 '24 17:09 tecosaur