bazel-lib icon indicating copy to clipboard operation
bazel-lib copied to clipboard

Refactor mtree_spec to vis encode filenames

Open pcj opened this issue 1 year ago • 6 comments

This PR refactors the mtree_spec rule:

Instead of emitting the .spec output file directly in it's final formatted form, it has been changed to emit a jsonl equivalent. This intermediate file is then processed line-by-line in a shell script with jq to assist with parsing. The filenames (and content) are encoded with vis prior to final mtree line formatting. A golden test has been added to e2e/smoke.

NOTE: this PR assumes the vis tool is present on the host machine (this is generally true for macos and linux). A future improvement could bundle the vis implementation from go-mtree.

Fixes #794

pcj avatar Mar 20 '24 04:03 pcj

We are trying hard to make our bsdtar more hermetic (likely something like https://github.com/bazelbuild/bazel-central-registry/pull/1662 with @dzbarsky ) so I'm pretty sad to regress that by assuming vis is on the host machine. We have a lot of users under nix or in a distroless/wolfi image that has a stripped down distro.

Sure. I intended this PR as a workable example that allows me to move forward, and possibly as a base for someone from your team to continue with. Maybe include go-mtree as a dependency and build/distribute pre-built binaries? Alternatively, a simple but dedicated C/C++ program using libarchive that gets built for the target architecture?

I'm also not clear why it's beneficial to encode into jsonl as a transport between starlark and the multi-line run_shell command - can't the latter read data from the struct directly?

Yeah there certainly many possible ways to pass the information into a separate subprocess that generates the final mtree output. I also wasn't totally clear on the code that generates the lines (first iterates through ctx.files.srcs, then the DefaultInfo of each one in ctx.attr.srcs (it is not redundant?)). In any case, I didn't want to mess with that so I made a minimal change to generate something (JSON) that I had a parser for (jq) without including additional tool dependencies (and without slurping everything into memory at once since these files can be huge). As long as the final output is correct, I of course don't care how it's done, feel free to revamp it.

pcj avatar Mar 22 '24 17:03 pcj

Thanks Paul - I'm not sure when I'll get time to work on this, but I imagine you can already patch in this change in the spot you need it, so you're not blocked on us merging and releasing?

alexeagle avatar Mar 25 '24 16:03 alexeagle

Tried this out locally on my linux machine (stock Ubuntu 22.04.4 LTS), and I don't have vis installed. So I think we need a hermetic binary we can fetch and run. @dzbarsky do you have some clever ideas where we should get it?

alexeagle avatar Apr 24 '24 16:04 alexeagle

I just ran into a nodejs dependency where the directory name included a space in it that is not covered by #835. I am using bazel_dep(name = "aspect_bazel_lib", version = "2.7.8")

The line in my mtree spec is

src/typescript/tours-api/tours-api.sh.runfiles/npm/node_modules/thread-stream/test/dir with spaces uid=0 gid=0 time=1672560000 mode=0755 type=dir

I had claude generate this genrule to cover it. So far tested on mac, need to test on linux still but its a simple awk script so i'm hopeful shelling to the host awk wont be a problem

    native.genrule(
        name = name + ".mf-sanitized",
        srcs = [name + ".mf"],
        outs = [name + ".mf-sanitized.spec"],
        cmd = """
awk '
    BEGIN {FS="uid="; OFS=""}
    {
        gsub(/ /, "\\\\040", $$1)
        sub(/\\\\040$$/, " ", $$1)
        print $$1 "uid=" $$2
    }
' "$(location """ + name + """.mf)" > "$@"
""",
    )

tkinz27 avatar Jul 17 '24 16:07 tkinz27

I just ran into a nodejs dependency where the directory name included a space in it that is not covered by #835

The line in my mtree spec is

src/typescript/tours-api/tours-api.sh.runfiles/npm/node_modules/thread-stream/test/dir with spaces uid=0 gid=0 time=1672560000 mode=0755 type=dir

I had claude generate this genrule to cover it. So far tested on mac, need to test on linux still but its a simple awk script so i'm hopeful shelling to the host awk wont be a problem

    native.genrule(
        name = name + ".mf-sanitized",
        srcs = [name + ".mf"],
        outs = [name + ".mf-sanitized.spec"],
        cmd = """
awk '
    BEGIN {FS="uid="; OFS=""}
    {
        gsub(/ /, "\\\\040", $$1)
        sub(/\\\\040$$/, " ", $$1)
        print $$1 "uid=" $$2
    }
' "$(location """ + name + """.mf)" > "$@"
""",
    )

tkinz27 avatar Jul 17 '24 16:07 tkinz27