bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Allow spaces in runfile source and target paths

Open fmeum opened this issue 1 year ago • 4 comments

Adds support for spaces in source and target paths of runfiles symlinks to build-runfiles as well as support for spaces in source paths to the Bash, C++, and Java runfiles libraries (the Python runfiles library has moved out of the repo).

If a symlink has spaces in its source path, it is represented in the manifest with a prefix of " <length> " (without the quotes), where <length> is the length in bytes of the source path. This scheme has been chosen as it has the following properties:

  1. There is no change to the format of manifest lines for entries whose source and target paths don't contain a space. This ensures compatibility with existing runfiles libraries.
  2. There is even no change to the format of manifest lines for entries whose target path contains spaces but whose source path does not. These manifests previously failed in build-runfiles, but were usable on Windows and supported by the official runfiles libraries. This also ensures that the initialization snippet for the Bash runfiles library doesn't need to change, even when used on Unix with a space in the output base path.
  3. The scheme is fully reversible and only depends on the source path, which gives runfiles libraries a choice between reversing the escaping when parsing the manifest (C++, Java) or applying the escaping when searching the manifest (Bash).

Newlines remain unsupported with this PR, but partial support could be added in a backwards-compatible way as a follow up: Assuming that neither the output base nor the workspace have paths containing newlines and ignoring custom (root) symlinks, the number of newline characters in the source path of a manifest entry equals the number of such characters in the target path. When combined with the length encoding introduced by this PR, this could be used to unambiguously parse entries with newlines out of the manifest. It is unclear whether this would be worth the increase in complexity for runfiles libraries and build-runfiles.

RELNOTES: Bazel now supports spaces in the rlocation and target paths of runfiles and can be run from workspaces with a space in their full path.

Fixes #4327

fmeum avatar Aug 19 '24 11:08 fmeum

@lberki I assigned you as reviewer for your general knowledge of runfiles, but feel free to redirect if needed.

fmeum avatar Aug 19 '24 20:08 fmeum

CC @meteorcloudy for Windows

fmeum avatar Aug 19 '24 21:08 fmeum

Hi @meteorcloudy, Since i can see that this PR has been approved, Please let us know whether gTech team should proceed with importing it. Thanks!

sgowroji avatar Aug 21 '24 05:08 sgowroji

@sgowroji I would prefer to wait for a second opinion, for example by @lberki.

fmeum avatar Aug 21 '24 06:08 fmeum

@fmeum: is the aim to get this into Bazel 8.0.0?

matthewjh avatar Sep 19 '24 13:09 matthewjh

In principle, adding the functionality to encode spaces in runfiles manifests is fine. I'm not sure, though, how I feel about this particular one because it's tailored to spaces and only spaces. How about implementing a scheme that could be more easily extended?

Spitballing, something like this: if a line in the runfiles manifest starts with a space, the algorithm to parse the line is as follows: the separator is between the two paths is a space, end of the line is a newline, any other character is a file name. If a file name contains a space, a newline or a backslash, these need to be escaped as \s, \n or \b, respectively, any other backslash sequences are reserved for future extension.

This is easy to implement, provably supports everything is just as incompatible with the existing system and looks more conventional than the encoding proposed in this pull request.

@fmeum WDYT? It looks like if we go through the pain of adding syntax to runfiles manifests, we could as well make it future-proof. I could also be convinced to make them some subset of JSON so that parsing them is possible with standard libraries but also with random logic so that one doesn't need to add additional dependencies.

lberki avatar Sep 19 '24 14:09 lberki

This is easy to implement, provably supports everything is just as incompatible with the existing system and looks more conventional than the encoding proposed in this pull request.

It lacks one important property, but we can also fix that: With this scheme in its current form, the Bash runfiles setup snippet as well as third-party runfiles libraries would need to be modified to continue to support manifest entries in which the rlocationpath contains no spaces, but the target path does (common on Windows).

We can fix that by not escaping spaces in the target paths and declaring that the first space on a line that doesn't start with a space is the separator. We would then only add a space at the beginning of a line if either the rlocationpath contains a space or newline or if the target path contains a newline.

At that point it's similar to the scheme in this PR, it just uses escape characters instead of length encoding. I agree that escaping is more conventional and probably not much harder to implement in most languages (it does change the length of the strings vs. just taking a substring, but that seems okay).

@lberki What do you think, should I go ahead and implement the modified version of your scheme?

fmeum avatar Sep 20 '24 08:09 fmeum

I implemented the scheme without support for newlines for now. They could always be supported in a follow-up, which is much simpler with the new scheme (just introduce a new escape character everywhere).

fmeum avatar Sep 20 '24 12:09 fmeum

The modified scheme looks reasonable. I'm travelling today so I can't review the pull request but will do so as soon as I can.

The main question in my mind is whether we do some sort of custom encoding (like this one) or go with a standard one. Of the standard ones, JSON seems to be the most palatable given the use case (simple list of key/value pairs intended to be parseable without any dependencies), but I'm not sure if it's worth the hassle and if it is, we are probably better off having an additional file called MANIFEST.json or something like that

lberki avatar Sep 23 '24 08:09 lberki

The main question in my mind is whether we do some sort of custom encoding (like this one) or go with a standard one. Of the standard ones, JSON seems to be the most palatable given the use case (simple list of key/value pairs intended to be parseable without any dependencies), but I'm not sure if it's worth the hassle and if it is, we are probably better off having an additional file called MANIFEST.json or something like that

Based on the churn that making runfiles work with Bzlmod caused, I'm pretty sure that this would not be worth it: We can't just replace the current MANIFEST format in an incompatible way, so we would be bound to make the procedure that finds dir or manifest even more complicated. At the same time, we wouldn't gain much as spaces in target paths are already supported, spaces in source paths are rare and newlines are so fringe that I don't think anybody cares about them.

As much as dislike a new custom format, it does allow runfiles lib authors to adopt support for spaces in source paths whenever they feel like it and without any pressure by an incoming incompatible change.

(I am planning to eventually simplify the way runfile dirs and manifests are discovered, but this needs more thought)

fmeum avatar Sep 23 '24 08:09 fmeum

@bazel-io fork 7.4.0

fmeum avatar Sep 23 '24 12:09 fmeum

@lberki This is ready for review again, I fixed all test cases.

fmeum avatar Sep 27 '24 16:09 fmeum

This is a bit of a doozy -- Google disallows the usage of #include <regex>, requiring the usage of re2 instead. Given the usage of regexes in this PR seems simple enough, could you replace them with some helper methods instead?

Wyverald avatar Oct 03 '24 22:10 Wyverald

@Wyverald Done

fmeum avatar Oct 04 '24 08:10 fmeum