Fix symlink creation on windows and add a test
The problem was that jj tried to set the symlink target to a path containing forward slashes.
Fixes #6934
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
the commit message should have the topic file_util: to conform to our guidelines outlined here: https://github.com/jj-vcs/jj/blob/main/docs/contributing.md#commit-guidelines otherwise LGTM
Updated - you might want to call out those docs more prominently in the contribution section of the readme.
you might want to call out those docs more prominently in the contribution section of the readme.
It links to the aforementioned doc already where it is the third point (thanks to @jgilchrist making it that a few months ago). IMO, it'd be better if we had a CONTRIBUTING.md which duplicated the information.
OT: To me it just seems like people don't want to read the guidelines or only the first few sentences.
It links to the aforementioned doc already
Yeah but the reason I suggested making it more prominent is that it's not obvious that the link is to a larger "contributing" doc at all, because the text just says "a few policies and suggestions". I specifically went looking for docs on contributing and didn't follow that link because of that text, and the summary underneath further implies that there's nothing particularly relevant at the link...
It makes no sense to have great docs on contibuting and then hide them away like "oh you don't need to read that, here's the TLDR".
Thank you very much! At a glance, this LGTM. I'll wait a day or so to give a chance for other people more familiar with Windows to take a look and comment.
Re the contributing guidelines, I think Diggsey has a good point that, as long as we provide a TLDR, people will be likely to not read beyond it.
IMO, this is fine, but it seems that it sometimes does not work out great.
To explain the status quo, in my mind, the guidelines exist for reference first of all. IMO, pointing new contributors to one section of them or another should be expected. We shouldn't feel bad about it, and we certainly don't want the contributors to feel bad about it. The original reason for the TLDRs was to encourage people who might not feel like reading a long doc to contribute.
We could open an issue about this, or even a PR if there are concrete ideas for how to improve the situation.
I've updated the PR based on the feedback. For symmetry I moved the path translation logic from file_util.rs to local_working_copy.rs.
I think it might make sense in future to move away from using String to represent symlink targets internally - perhaps using BString, RepoPath or a custom type instead, so that it's less likely to accidentally treat one as a native file-system path.
I think it might make sense in future to move away from using
Stringto represent symlink targets internally - perhaps using BString, RepoPath or a custom type instead, so that it's less likely to accidentally treat one as a native file-system path.
I don't like BString because we want a type that implies a particular encoding (like utf-8) so we don't risk interpreting it as whatever encoding the file system uses. I also don't think we can use RepoPath because that only allows paths relative to the workspace root (and we allow paths outside the workspace). We could add a simple newtype around String.
I don't like BString because we want a type that implies a particular encoding (like utf-8)
I don't think git limits symlink targets to utf-8 though does it?
and we allow paths outside the workspace
Isn't this a security risk? 👀
I don't think git limits symlink targets to utf-8 though does it?
I'm pretty sure it doesn't. It also doesn't restrict paths to be utf-8 like we do. The problem with both of them being in an unspecified encoding is that it makes no sense to share such repos between users on different platforms. If I commit some files as bytes in whatever encoding my local file system uses and then I share the repo with you, the paths may not make any sense on your file system. See e.g. https://wiki.mercurial-scm.org/EncodingStrategy#The_.22makefile_problem.22.
Isn't this a security risk? 👀
As long as we don't follow links outside the repo, I think it's fine.
@yuja slash_path has input and output type of Path which I don't think makes sense for this use-case.
The way slash_path is currently used is when writing the git_target metadata file, and it's a bit of a hack for WSL, since Path is not inherently portable.
All current paths used by jujitsu fall into one of these encodings (eg. RepoPath is a type of StringPath):
Path - OS-specific encoding, non-portable, can represent all paths StringPath - UTF-8 path, portable, cannot represent all paths BytePath - byte path, portable, cannot represent all paths (on windows)
The conversions between Path and BytePath is unusual because it only supports utf-8 on windows but supports all paths on unix.
slash_path is not really a good abstraction on its own, it only makes sense when paired with path_to_bytes (which it currently is). We then have path_to_bytes(slash_path(path)) being a roundabout way of implementing the Path -> BytePath conversion for the purpose of serializing it to disk.
If I was going to spend more time on this I would replace all the ad-hoc path conversions with a strict set of conversions between these primitive path types, but I think that would make this PR grow beyond the original scope.
@yuja
slash_pathhas input and output type ofPathwhich I don't think makes sense for this use-case.
Why? target is a PathBuf, so there wouldn't be much difference between slash_path(&target).to_str() and slash_path(target.to_str()). I know symlink content isn't actually a path, but since we transform it as a path, I don't think there would be a strong reason to avoid using Path/PathBuf type.
slash_pathis not really a good abstraction on its own, it only makes sense when paired withpath_to_bytes(which it currently is). We then havepath_to_bytes(slash_path(path))being a roundabout way of implementing the Path -> BytePath conversion for the purpose of serializing it to disk.
I kind of agree that slash_path is a bit of hack, but I disagree with you about the path_to_bytes connection. slash_path can process any platform Path types because Path can be concatenated with UTF-8 string. If we're strict, the return type might have to be OsStr, though.
I think we'll have to use slash_path() to implement https://github.com/jj-vcs/jj/issues/6592.
Path - OS-specific encoding, non-portable, can represent all paths StringPath - UTF-8 path, portable, cannot represent all paths BytePath - byte path, portable, cannot represent all paths (on windows)
I'm not sure what you mean by BytePath (is that Git's tree path, or std::path::Path converted to bytes based on a certain encoding assumption?), but in any case, the scope where BytePath is used would be very restricted, so I don't think we'll need the BytePath abstraction. We'd better off using more specific types (e.g. GitTreePath) if needed.
I think we'll have to use slash_path() to implement https://github.com/jj-vcs/jj/issues/6592.
I think jj should avoid using the platform-specific Path types internally, and only use them at the boundary where it needs to interact with the OS. Internally it should use portable path representations that don't change across platforms. The reasons for this are:
- It solves #6592 .
- It reduces the amount of platform-specific edge cases that need to be handled.
- These internal paths need to be serialized in a portable way (eg. when written to the
git_targetfile, or when sent over the network.)
slash_path is a hack because it creates platform-specific Paths that nonetheless don't use the expected path separator (leading to errors like this symlink issue) because there are no good ways to "unslash" a platform-specific path, and because these paths cannot be serialized in a portable way.
I'm not sure what you mean by BytePath
A BytePath is a path encoded as a sequence of bytes without a specific text encoding and with path segments separated by forward slashes (ie. a unix file path or git tree path). The contents of the git_target file in jujutsu is currently a BytePath, since it is not required to be valid utf-8, and it does no "transcoding" of unix paths.
One might choose a BytePath-like representation over a StringPath if you need to store absolute paths, since a parent path segment may contain invalid-UTF8.
TLDR: I'm advocating using stronger abstractions around paths, to avoid the need for "fix-up" style functions which change the value of a path without changing the type of the path, since this is inherently error prone.
I think jj should avoid using the platform-specific Path types internally, and only use them at the boundary where it needs to interact with the OS.
I agree, and I think the current implementation basically does that. Internal paths are RepoPath, and converted to/from std::path::Path. A symlink content isn't RepoPath because it is an arbitrary user data (like file content.)
ui.slash handling will be implemented at the boundary, and I think the ui.slash feature itself is a hack. As you said, the native separator on Windows shouldn't be "/", but some users want it because it's more convenient in some circumstances.
I'm not sure what you mean by BytePath
A BytePath is a path encoded as a sequence of bytes without a specific text encoding and with path segments separated by forward slashes (ie. a unix file path or git tree path). The contents of the
git_targetfile in jujutsu is currently a BytePath, since it is not required to be valid utf-8, and it does no "transcoding" of unix paths.
That seems wrong. git_target is platform-dependent data. It's bytes on Unix, and UTF-8 on Windows. Paths in Git tree objects are happened to be the same (or to be more accurate, we simply expect Git paths are UTF-8), though.
ui.slash handling will be implemented at the boundary
When I have an output like:
M lib\src\local_working_copy.rs
Why not just print the RepoPath instead:
M lib/src/local_working_copy.rs
No need to be printing platform-dependent paths at all.
That seems wrong. git_target is platform-dependent data. It's bytes on Unix, and UTF-8 on Windows
Except you've swapped the slashes to make it more compatible with WSL, so why not just say it's a platform-independent path? Writing from one platform and reading from another is already a supported use-case.
or to be more accurate, we simply expect Git paths are UTF-8
There's no translation on unix between git and fs paths, so really there's an assumption that all paths are super-sets of UTF-8.
No need to be printing platform-dependent paths at all.
unless the user relies on copy/pasting it to another command or program
unless the user relies on copy/pasting it to another command or program
Not sure what you mean - I'm saying that the "ui.slash" option being proposed should semantically be an option for changing between displaying internal "repo paths" vs external "platform dependent paths". Whether or not you agree that these are the correct semantics the behaviour seen by the user would be the same.
ui.slash handling will be implemented at the boundary
When I have an output like:
M lib\src\local_working_copy.rsWhy not just print the RepoPath instead:
M lib/src/local_working_copy.rs
The output is relative to cwd, not to the workspace root.
That seems wrong. git_target is platform-dependent data. It's bytes on Unix, and UTF-8 on Windows
Except you've swapped the slashes to make it more compatible with WSL, so why not just say it's a platform-independent path? Writing from one platform and reading from another is already a supported use-case.
Because the encoding might be different. It's usually UTF-8 these days, though.
or to be more accurate, we simply expect Git paths are UTF-8
There's no translation on unix between git and fs paths, so really there's an assumption that all paths are super-sets of UTF-8.
Yes. Our assumption is that paths in Git trees are UTF-8, but that's not guaranteed. If there were non-UTF-8 paths, jj would fail to check out the commit.
Issue #6934 is biting me hard on Windows - are there any plans to land this in the near future, or are y'all looking for more help to land this? If the latter happy to assist in any way possible to add better support for those infernal symlinks on Windows.
~~Hmm scratch that, fix in this PR does not fix the issue for me it seems.~~ I had a rebase already partially gone bad hence why fix in this PR didn't seem to work. I can confirm though that it indeed does fix rebasing between two bookmarks that are out-of-sync and there were potentially changes in one of the bookmark that moved symlinks around in a remote git repo.