rules_pkg icon indicating copy to clipboard operation
rules_pkg copied to clipboard

Packaging rules should support --stamp flag

Open nacl opened this issue 3 years ago • 17 comments

#285 requested a feature to preserve mtime's in tar files. @aiuto brought up the below WRT how this might be assembled:

Originally posted by @aiuto in https://github.com/bazelbuild/rules_pkg/issues/285#issuecomment-770153075

Let's discuss solutions in an issue first. I would like to to have as many people as possible weigh in on what is useful. There are a lot of questions

  • do we change the stamp of every file?
  • does that include files which came from pre-made archives that have a particular timestamp of their own?
  • some people want to put the timestamp into the output filename. While I find that dubious, any stamping solution should consider it.
  • Why is stamp_mtime an attribute? Why not just support the --stamp flag. That seems to be the API of least surprise.
  • and.... no solutions that just to .tar. We need do do the appropriate thing for each of pkg_zip, pkg_deb and pkg_rpm. But that makes getting the API right before doing any code even more important.

This issue tracks the design of a "stamping" mechanism for packages prior to implementation, especially with regards to fixing/clamping file modifications. Whether the --stamp argument should control this is also a question to consider.

nacl avatar Feb 03 '21 17:02 nacl

AFAIK, my particular use case does not have a preference WRT preserving mtimes, but some things that immediately come to mind are:

  • Some rules support SOURCE_DATE_EPOCH (e.g. pkg_rpm), others do not. It also applies to more than mtimes, so that may be out of scope here.

  • --stamp feels like the tool of choice here, but BUILD_TIMESTAMP may also not be the right value to use. Bazel provides it for free.

nacl avatar Feb 03 '21 17:02 nacl

The least surprising behavior is that all the top level pkg rules implement stamp as done in cc_binary. I consider that a minimal requirement.

stamp

  • Integer; optional; default is -1
  • Enable link stamping. Whether to encode build information into the binary. Possible values:
    • stamp = 1: Stamp the build information into the binary. Stamped binaries are only rebuilt when their dependencies change. Use this if there are tests that depend on the build information.
    • stamp = 0: Always replace build information by constant values. This gives good build result caching.
    • stamp = -1: Embedding of build information is controlled by the --[no]stamp flag.

#288 demonstrates how the default behavior could be done. That is, I can get the build time for stamping if --stamp is set on the command line. I didn't do the 0 and 1 cases, nor implement for every rule.

An orthogonal feature to that is if we change the time stamps of files expanded out of archives (if I am reading #265 right). That is it's own debate. The request I have seen in the past is along the lines of

  • I have a bundle of static files that will be served by a web service
  • they have a set of creation dates.
  • I bundle them into a zip file and check into my source tree
  • I want to include them in a .deb package, but individually not as one zip
  • I want the original time to be preserved.

I think we need provision for that kind of behavior. I don't have a design in mind right now. I don't think it is a blocker for --stamp support, as long as we design with the idea in mind that individual files might need to say they are immune to stamping @nacl is rpm's use of SOURCE_DATE_EPOCH similar to this?

Then there is the question of using the time stamp in the archive name This is difficult for many reasons and we should not block basic stamping on that.

aiuto avatar Feb 04 '21 04:02 aiuto

Thanks for following this up and your comments.

My main motivation is to set a more dynamically mtime for the input files. At the moment either the default behavior with portable or a static mtime can be used, but I'm looking for a "self-adjusting" mtime based on the input files. The idea was to use a new volatile status variable which is for example set by a bash script based on the input files' mtime or git command to get a less changing variable than the BUILD_TIMESTAMP.

stamp = 1: Stamp the build information into the binary. Stamped binaries are only rebuilt when their dependencies change. Use this if there are tests that depend on the build information.

If I understand this correctly with "stamp = 1" the rebuild is only triggered if the input of the specific rule change, i.e. the BUILD_TIMESTAMP is relatively close to the recent mtime of the input files. The only drawbacks are fresh builds, e.g. if no cache is available. #285 had an other intention, but if the stamp behavior is closer to the build-in rules and less complex I guess this would even fit my needs somehow.

I will have a closer look into #288.

dennis3620 avatar Feb 04 '21 19:02 dennis3620

Maybe we should start from needs for stamping, and then work back to features to achieve that.

  1. We certainly need the default to be a repeatable constant value. This gives us easily testabl repeatability. That should be the default behavior for all rules.
  2. On the day we want to cut a release, we probably want the time of build to be used in the binary. Here I think the default Bazel stamp behavior works well. Builds use a constant mtime by default, so during development and testing you don't overbuild. When you use --stamp, you want the package to use time of build
  3. Some targets are "extra-volatile". This is the stamp=1 case. Maybe we have a process where we want to do something like bazel run :push, where push depends on a .deb pacakge we built. Should multiple runs in a row rebuild the .deb each time, even if no other sources changed? I think the current Bazel idea is right here too. Do not rebuild the binary package, but do run the action which depends on it.
  4. Is there a need for users to specify a build time which is neither the epoch constant or the current time? That is the case I think you are trying to provide for.

Let's say there is a user for case 4. A hypothetical is that I am cutting a product release on Feb. 4, and I want the stamps to be 2020-02-04 33:22:11. Maybe I used the date as is, but encode something in the hh:mm:ss. We obviously don't want to have to set that in each rule, so the value has to come from the command line in some way. We would either pass the value (which requires a blaze change) or use workspace status commant (which does not).

Question 1: Can workspace status just replace BUILD_TIMESTAMP? If so, then that seems to be the way to do this. Question 2: Is there a need for workspace status to emit 2 or more distinct time values and have different rules pick which of those they want? I strongly doubt that. Question 3: How ofetn do we need anything beyond stamping with the current build time?

aiuto avatar Feb 05 '21 04:02 aiuto

Status update: #288 is ready to go

  • it implements the --stamp flag and stamp attribute behavior of Bazel's *_binary rules
  • It gets the time stamp from the BUILD_TIMESTAMP from the volatile file.

If we need to stamp with arbitrary times other than build time, the technique could

  • be expanded to allow a timestamp to come from the command line.
  • be expanded to allow a timestamp to come any user defined file.

aiuto avatar May 06 '21 15:05 aiuto

So most of the comments here are about using stamping to determine mtimes for the files within a package. What about support though to use values from the workspace status command as part of variable inputs to the packaging rules?

I'm mostly looking for a simple way to generate this version string for our debian package

git tag "v0.0.$(git show -s --format=%ct HEAD)-$(git rev-parse --short HEAD)"

Pretty easy to output this as part of the workspace status command. The challenge I'm having is getting that into a place the pkg_deb rule can use it. I'm fairly new to bazel but it seems like each rule has to implement support for reading the ctx.version_file and ctx.info_file directly.

I am currently using a custom rule to generate a version file and use that. While the version correctly gets injected into the control file of the debian package, it is not used as part of the package name itself. I'm not sure if this is just a bug or a specific choice.

tkinz27 avatar Jun 20 '22 18:06 tkinz27

It's a specific choice, but more on Bazel itself rather than rules_pkg. We need to predict file names during analysis, but we don't let you read the info file containing the variable until execution phase. To have this work in the easiest to understand way, we would need Bazel to be able to have the equivalent of a workspace status command that created variables that got put directly into a configuration fragment. This would be useful for a handful of people (doing packaging) and incredibly dangerous foot-gun where you could ruin performance by having everything rebuild all the time.

Take a look at the examples in: https://github.com/bazelbuild/rules_pkg/tree/main/examples/naming_package_files Don't just read the doc, look into the BUILD file and see some of the things there. Some things you might try:

  1. Don't use workspace status command. Instead, write a wrapper script around Bazel that does the work you want, then calls command with a command line flag to pass in the git version.
  2. My hunch is that you could turn the git-revision into an self configuring toolchain. The hash of the head would become a toolchain variable, and then you could pull it into your package name.

aiuto avatar Jun 20 '22 20:06 aiuto

Thanks for the feedback. Setting up a toolchain seems like an interesting idea. I did look at the naming_package_files but couldn't figure out how to generate the PackageVariablesInfo provider since rules cannot read files or (at least from what I can see) get the stdout of a command they run. From that example I didn't think to try to create a custom toolchain just to get the git revision, I will have to try that next.

One thing that was confusing was the version_file attr in pkg_deb not being used in the package name but is embedded in the debian control file. I'm not sure from your answer if you are saying that is also a specific choice or if you were just talking about not being able to use the variables from workspace status. I currently do have a custom genrule that generates that version file its just was really surprising when it was not used as part of the package name. I actually thought my genrule was broken and not being picked up but only after looking at the source for pkg_deb did I realize only the version attr was used in the package name.

Lastly, it seems like other bazel rules do seem to support utilizing the workspace status variables IFF the --stamp option is provided. For instance here is support in rules_python, rules_go, rules_nodejs, rules_docker. Any chance rules_pkg would be interested in similar support?

Again appreciate the feedback and explanation. It helps me understand bazel a little bit more.

tkinz27 avatar Jun 21 '22 16:06 tkinz27

The fundamental bazel behavior is that no file reads other than ..bzl and BUILD files can impact the analysis phase, and that all action input and output files are defined during the analysis phase. Thus it is fundamentally impossible to read a version file and have it change the name of an output file.

Lastly, it seems like other bazel rules do seem to support utilizing the workspace status variables IFF the --stamp option is provided.

I think they hit the same wall. You can't use the variables from the workspace file in an output file name.

aiuto avatar Jun 22 '22 02:06 aiuto

You can't use the variables from the workspace file in an output file name.

This is generally not a blocker. It's not hard to have a CI system do something like

bazel build //:my_output.tar //:my_dest.txt
cp bazel-bin/my_output.tar "${ARCHIVE_DIRECTORY}/$(<bazel-bin/my_dest.txt)"

where the build for my_dest.txt does take the stamp information into account.

One case that is problematic for this is if you want the name of a file within an archive to depend on stamp information. We have the option for a file providing a prefix for every file in the archive, but it would be useful to have the option to e.g. provide a json file (which could be built by a thing that uses stamp information) with a set of renames, rather than the current situation which is that renames need to all be specified in the BUILD file.

adam-azarchs avatar Oct 24 '23 01:10 adam-azarchs

You can make the renames in a .bzl file that you generate before the bazel build, then load that from BUILD files where you want to rename things.

aiuto avatar Oct 25 '23 13:10 aiuto

Are you suggesting running a bazel build that modifies the source tree to generate a bzl file using stamping information, and then running a second build that uses that? That sounds error prone and difficult to enforce that the generator step is in sync with what the rest of the build is seeing (even worse if the bzl file was being generated by a tool that wasn't part of the bazel build configuration) as well as being slow and inconvenient compared to a solution that only requires a single bazel invocation. It also is problematic because we're talking about a map between bazel File targets and the names we want those files to have in the archive. We don't necessarily have an easy way to know the list of files that need to be renamed without first doing a build, or else we end up with a maintenance headache of needing to collect the list of transitively included files in more than one place.

Fixing this more cleanly wouldn't be terribly difficult; just allow a user to supply an executable target which can modify the manifest file before it's handed to the tarball packaging tool.

adam-azarchs avatar Oct 26 '23 00:10 adam-azarchs

No. I meant that a --workspace_status_command script would generate status data and a .bzl file (in your source tree) that can be loaded by the BUILD file. Then you turn that into PackageVariables which can be injected as needed.

That does not completely help you, but I can easily imagine extending pkg_files so you can take inputs and regex rename them to new things, using the variable read from the .bzl file.

aiuto avatar Oct 26 '23 09:10 aiuto

That sounds rather dangerous to me, as I'm not aware of any documented contract that says that the workspace status command will always be evaluated before the loading phase, or in fact that it would be evaluated at all if no other build actions requested its output.

adam-azarchs avatar Oct 30 '23 03:10 adam-azarchs

Sure. But what other choices do you have? Bazel doesn't make the time stamp accessible directly to Starlark at analysis time, so I can't get it into the file name. The same for the git hash, or workspace name, or any variety of transient values.

aiuto avatar Oct 31 '23 02:10 aiuto

Your statement of the problem makes the solution clear - starlark does not have access to the required information, which means that the renames can't be done in starlark. But that's ok - starlark produces a manifest file that's consumed by the tarball builder action; there's no reason another build action couldn't be interposed between the starlark and the tarball builder, to either rewrite the starlark-generated manifest file or supply a separate renames file that the tarball builder would be able to merge with the one written out by the starlark. Unlike starlark, that action can depend on the workspace status files.

That is, something like

def _pkg_tar_impl(ctx):
    ...
    write_manifest(ctx, manifest_file, content_map)
    if ctx.executables.manifest_transformer:
        modified_manifest = ctx.actions.declare_file(ctx.label.name + ".mod.manifest")
        ctx.actions.run(
            outputs = [modifed_manifest],
            inputs = [manifest_file, ctx.version_file, ctx.info_file],
            executable = ctx.executables.manifest_transformer,
            arguments = [manifest_file.path, ctx.version_file.path, ctx.info_file.path, modified_manifest.path],
        )
        manifest_file = modifed_manifest
    ...

adam-azarchs avatar Oct 31 '23 04:10 adam-azarchs

OK. I see how what you are proposing could be possible. That would solve the case of changing file names to include time stamps or commit, or whatever you want to write into the info file with the workspace status command.

aiuto avatar Nov 03 '23 01:11 aiuto