mill icon indicating copy to clipboard operation
mill copied to clipboard

Make `out/` folder contents (more) reproducible and filesystem layout agnostic

Open lihaoyi opened this issue 1 year ago • 11 comments

The goal of this ticket is to make the out/ folder contents more reproducible, such that it contains the same bytes and hashes regardless of the user's filesystem layout outside of that folder. This is would allow re-using the out/ folder as a build cache between different machines that may have the checkout in different place (e.g. /Users/alice/my-repository vs /Users/charlie/my-repository), both coarse grained (e.g. by sending over a zip file) and fine grained (via the bazel remote cache protocol)

The main thing that needs to happen is that every os.Path and mill.api.PathRef that is serialized within a "known" directory needs to be normalized to a path relative to an abstract reference to that known directory. e.g.

  • /Users/alice/my-repository/out/foo/bar.dest/qux should be serialized as $WORKSPACE/out/foo/bar.dest/qux
  • /Users/lihaoyi/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.13.14/scala-library-2.13.14.jar should be serialized as $COURSIER_CACHE/v1/https/repo1.maven.org/maven2/org/scala-lang/scala-library/2.13.14/scala-library-2.13.14.jar
  • /Users/alice/thing-outside-repository should be serialized as $HOME/thing-outside-repository

AFAIK the necessary known roots should all be available globally (e.g. mill.api.workspace.WorkspaceRoot.workspaceRoot, os.home, sys.env("COURSIER_CACHE")). It should be easy enough to add to the serialization logic:

  • mill.api.PathRef serialization https://github.com/com-lihaoyi/mill/blob/e0a2c93bfbc7bf68a71ddbc0c52afbb14e73e6f2/main/api/src/mill/api/PathRef.scala#L175-L197
  • os.Path serialization https://github.com/com-lihaoyi/mill/blob/e0a2c93bfbc7bf68a71ddbc0c52afbb14e73e6f2/main/api/src/mill/api/JsonFormatters.scala#L27-L31

Apart from PathRef and Path, we will also need to deal with:

  • Files in out/ which are naturally non-deterministic: mill-profile.json, mill-chrome-profile.json, mill-server/* and mill-no-server/*, etc.

  • Modified times are also expected to vary. These may need to be zeroed out in the process of making zip and jar files such that they do not affect the byte contents, and ignored as part of any equivalence comparison

  • Any foo.json files belonging to workers can also be expected to differ since they contain the toString of the worker, and may need to be renamed to foo.worker.json or similar to make them identifiable.

  • There will also be inherent differences between files generated on different platforms (e.g. native binaries). This is fine for now, and likely unavoidable.

  • There may be other files that need to be made reproducible that are not listed here

The success criteria would be a test in integration/feature/ that:

  • Copies the code in example/scalalib/web/5-webapp-scalajs-shared into two separate subfolders.
    • The choice of example/scalalib/web/5-webapp-scalajs-shared is somewhat arbitrary, but should give us good coverage of a variety of Mill module and task types, exercising a wide range of code paths
  • Runs ./mill runBackground && ./mill clean runBackground && ./mill jar && ./mill assembly in each folder
    • (one with a custom COURSIER_CACHE and -Duser.home passed in),
  • Does a file-by-file and byte-for-byte comparison against the two outfolders with some normalization criteria (ignoring the expected-to-differ files and ignoring mtimes), to assert that the out/ folder is byte-for-byte identical

Related issues with prior discussion:

  • https://github.com/com-lihaoyi/mill/issues/2101
  • https://github.com/com-lihaoyi/mill/issues/2153
  • https://github.com/com-lihaoyi/mill/issues/1886

lihaoyi avatar Oct 04 '24 03:10 lihaoyi

Trying this issue, it looks interesting...

rahat2134 avatar Oct 13 '24 04:10 rahat2134

@rahat2134 got for it! Feel free to ask here if you have any questions

lihaoyi avatar Oct 13 '24 07:10 lihaoyi

Does reproducibility have to be a property? What if it is defined as a transformation?

A filesystem agnostic image can be created by replacing (string) values with "known env vars" in task output file copies. At the point of ingestion, a reverse substitution would be applied to recreate the files for a different environment.

ajaychandran avatar Mar 16 '25 14:03 ajaychandran

I would be interested in working on this if the issue is still open?

Edit: Upon further reading, I am requesting a lock on the bounty--- for at 1 week. If I don't have anything mostly done by then, you can free it. Seeing as how its from October, I don't think that is an unreasonable amount of time. Thank you.

Update 2:

I have studied the project in the 5 days now since, and have a working prototype. I should, in 2 days have a hopefully, somewhat working prototype. And the time after will be left for writing tests :)

Update 3:

https://github.com/albassort/mill

I predict it will be mostly be done by Wednesday, maybe the test will be done on Wednesday

albassort avatar Mar 26 '25 23:03 albassort

Our large Scala projects workaround the reproducibility issues with this hacky tool: https://github.com/Avimitin/mill-ivy-fetcher and finally made mill-based solutions reproducible. It has been fully integrated into chipsalliance/t1.

I think we may help the out dir reproducibility, and wanna hear more suggestions from haoyi.

sequencer avatar Mar 27 '25 04:03 sequencer

https://github.com/com-lihaoyi/mill/pull/4870

PR Submitted

albassort avatar Apr 05 '25 08:04 albassort

@lihaoyi can I work on this issue ?

rishi-jat avatar Oct 23 '25 11:10 rishi-jat

@rishi-jat I think this ticket is probably too difficult to be a viable bounty, as various attempts in the past have never worked out. I'll remove the bounty for now so people don't waste their time

lihaoyi avatar Oct 23 '25 11:10 lihaoyi

I tackled the path-mapping for PathRef and os.Path for some selected roots: $WORKSPACE, $HOME and $MILL_OUT in PR https://github.com/com-lihaoyi/mill/pull/6031

This does not fix the issue that other task may contain absolute paths too, e.g. scalacOptions, but it's a step forward in making the out dir relocatable.

lefou avatar Oct 28 '25 07:10 lefou

I wrote up a small proposal to introduce a new Args type as a replacement for Seq[String] in tasks like javacOptions.

  • https://github.com/com-lihaoyi/mill/discussions/6057

lefou avatar Oct 30 '25 13:10 lefou

Discovered another source of non-relocatability:

Generated code by MillBuildRootModule.generatedScriptSources contains absolute paths. https://github.com/com-lihaoyi/mill/blob/0576e43a86577f886f8f70e49e519074a17af1f5/runner/meta/src/mill/meta/CodeGen.scala#L470-L472

Just making them relative isn't enough, since also a relative output path output0 like ./out2 would change the task cache hash, but should not.

Instead, those should be read or given at runtime.

lefou avatar Nov 26 '25 14:11 lefou