bazel-lib
bazel-lib copied to clipboard
feat: vis-encode all path characters except visible ASCII
This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up.
Generating an archive with non-ASCII characters reports a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents.
INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8
Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide
access to a string's bytes, only its codepoints, at least until the spec and implementation work
to get a bytes type lands.
So a second, out-of-process build-time pass is needed. Since a separate pass
is needed anyways, we move most of the encoding responsibility to this external script
(except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash).
We use gawk for this second pass. A prior attempt
at this feature had done a similar external pass with the vis utility, but has
not been accepted because the vis tool is not expected to be available everywhere.
gawk is (TODO: will be) provided by a toolchain from this repo.
We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default.
Fixes https://github.com/bazel-contrib/bazel-lib/issues/794
FYI, this patch is fairly substantive, but I believe there are separable elements. I've filed it as one PR because this is my desired end-state. The str.maketrans/str.translate feature could be its own PR if we wanted to expose it as a public API. And just the 7-bit encoding could be a separate, intermediate improvement. Please let me know if you want to land this in phases; either the ones I've outlined or some other partitioning scheme.
Moved to https://github.com/bazel-contrib/tar.bzl as https://github.com/bazel-contrib/tar.bzl/pull/3 to follow the tar rule to its new home.
thanks @plobsing - sorry for the churn