mill icon indicating copy to clipboard operation
mill copied to clipboard

Dynamically map known root paths in cache files

Open lefou opened this issue 1 month ago • 8 comments

Another attempt to make Mill cache storage relocate-able.

The idea is to replace absolute paths with paths relative to pre-defined known root paths.

Fix parts of https://github.com/com-lihaoyi/mill/issues/3660

API

Path mappings are managed via the new MappedRoots object. Mappings use a thread local DynamicVariable and various convenience methods to easily adapt the mapping.

An explicit API to change the mapping dynamically is necessary, as the underlying JSON-serialization mechanism can be used for other purposes than Mill caching, like RPC protocols, where a path mapping is not desirable.

Default Mapping

These are the default location which are mapped by Mill:

  • $WORKSPACE - the project directory (where the top-level build.mill can be found. (BuildCtx.workspaceRoot)
  • $MILL_OUT - the Mill output directory used for caches and build-results
  • $HOME - the user home directory (os.home)

Mappings are automatically used in the upickle-based JSON serializers for mill.api.PathRef and os.Path.

More paths come to mind:

  • $CACHE_COURSIER - the coursier cache directory
  • $CACHE_IVY - the ivy cache directory
  • $CACHE_MAVEN - the Maven cache directory

Those are currently not directly accessible in the code base for all the places where we need to be able to read and write JSON cache files. Also, their typical storage locations are already relative to $HOME (e.g. $CACHE_COURSIER vs $HOME/.cache/coursier), hence I left them out for this PR.

We can later add them by adapting Mapping.withMillDefaults.

Design Choices

One important design choice I made, is to go with a dynamic set of known roots. This means, without any registered mapping, PathRef/os.Path JSON-serialization will behave exactly as before. If there are registered path root mappings at serialization time, these mappings will be used. While de-serialization of a mapped path, a mapping for the used key is required, otherwise the de-serialization will fail with an runtime exception.

While it is not ideal to have potential runtime failures, an alternative design with a static mapping, which should behave more deterministically, hasn't worked out well. This is mostly due to the fact that JSON serialization is a universal concept which is, even in Mill, used for various use cases, and not uniquely for Mill cache persistence. While in theory, it should be controllable via the given/implicit scopes, practically we need to define all virtual paths mapping anywhere in the code base, which isn't only impractical, but also sometimes impossible. Effectively it would require us to use dedicated data types for when we need path mappings (e.g. in Mill caches) and when not (e.g. when defining RPC protocols). See PR https://github.com/com-lihaoyi/mill/pull/6021 where I tried to have a fixed mapping, but had a lot of issues.

PathRef hashCode changes

Due to the way the PathRef.hashCode is used in Mill, esp. in the Evaluator when deciding whether a task needs re-computation, it was important to keep the Java hash code stable and consistent. PathRef.hashCode now uses a mapped path (as configured at construction time) instead of the unmapped, absolute path. Otherwise, it causes constant re-evaluations of already cached tasks. It should be possible to "construct" cases, in which this leads to more hash collisions, but those cases are very unlikely, unrealistic and should still be unproblematic, since a path-change without a sig-change still means, we refer to the same path structure and content.

Other changes

Some refactoring was necessary to streamline the configuration flow of the $MILL_OUT path. Before, the out path was constructed in many different locations, sometimes with slightly different logic. This is of course a minefield, since we have many entry points (Mill CLI, Mill Daemon, Mill BSP, RPC worker processes, Test Suites, Integration tests, ...) where those where set up. So this PR relies heavily on our programmatic test coverage.

Other effects

The show command can be used to show the result of a (cached) task. It now no longer shows local absolute paths for locations covered by a mapped root (which should be the most occurrences) but the paths with placeholders instead. This is IMHO less helpful, but not so easy to fix correctly.

lefou avatar Oct 27 '25 10:10 lefou

I'm down to a single failing test case for which I don't have an explanation or fix for:

> mill integration.feature[output-directory].packaged.daemon.testForked
...
[7778-0] + mill.integration.OutputDirectoryTests.Output directory elsewhere in workspace 24540ms  
[7778-0] Copying integration test sources from /home/lefou/work/opensource/mill/integration/feature/output-directory/resources to /home/lefou/work/opensource/mill/out/integration/feature/output-directory/packaged/daemon/testForked.dest/worker-0/sandbox/run-3
[7778-0] X mill.integration.OutputDirectoryTests.Output directory outside workspace 6137ms 
[7778-0]   java.lang.AssertionError: assertion failed: ==> assertion failed: false != true
[7778-0]     scala.runtime.Scala3RunTime$.assertFailed(Scala3RunTime.scala:8)
[7778-0]     utest.asserts.Asserts$ArrowAssert.$eq$eq$greater(Asserts.scala:90)
[7778-0]     mill.integration.OutputDirectoryTests$.tests$$anonfun$1$$anonfun$3$$anonfun$1(OutputDirectoryTests.scala:35)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:15)
[7778-0]     scala.runtime.function.JProcedure1.apply(JProcedure1.java:10)
[7778-0]     mill.testkit.IntegrationTestSuite.integrationTest$$anonfun$1(IntegrationTestSuite.scala:60)
[7778-0]     mill.util.Retry.apply$$anonfun$1(Retry.scala:38)
[7778-0]     mill.util.Retry.apply$$anonfun$adapted$1(Retry.scala:38)
[7778-0]     mill.util.Retry.$anonfun$1$$anonfun$1(Retry.scala:49)
[7778-0]     scala.util.Try$.apply(Try.scala:217)
[7778-0]     mill.util.Retry.$anonfun$1(Retry.scala:49)
[7778-0]     java.lang.Thread.run(Thread.java:1583)
[7778-0] Tests: 3, Passed: 2, Failed: 1
...
1 tasks failed
[7778] integration.feature[output-directory].packaged.daemon.testForked 1 tests failed: 
  mill.integration.OutputDirectoryTests mill.integration.OutputDirectoryTests.Output directory outside workspace

lefou avatar Oct 28 '25 07:10 lefou

Fixed last test, which was caused by an accidental code paste

lefou avatar Oct 28 '25 09:10 lefou

This is good to merge. The next step is probably to add more external cache dirs like coursier cache. Usage of MappedRoots.withMillDefaults API is a good indicator, where we need to initialize these values. Maybe @alexarchambault does have an opinion about the best strategy to provide the value here? /cc @lihaoyi

lefou avatar Oct 29 '25 07:10 lefou

The approach taken by https://github.com/com-lihaoyi/mill/pull/4642 is to instrument os.Path#{toString,hashCode,etc.} to customize the rendering, along with some strategic symlinks to make those relative paths valid when passed to subprocesses. That approach solve the issue where paths get passed around embedded in other data structures, e.g. in scalacOptions or similar. IMO that would be a bit more robust than what you're doing here which only works for paths directly returned as PathRefs from a task

lihaoyi avatar Oct 29 '25 08:10 lihaoyi

This PR is explicitly just about paths where path is an explicit type. All other cases are paths in strings, which could be solved by some optimistically find-and-replace mechanism, which I personally only see as a last resort approach. Another way could be to make config task use more structured formats than just Seq[String], which could then be based on a well-defined type-based mapping. But I considered it out of scope of this PR. (We should / I will elaborate those ideas in #3660.)

I must admit, I overlooked PR #4642 when I started this one. (Is it even linked?) But I don't like the sym-link approach. Also, one takeaway of this PR`s predecessor (#6021) is, that for communication with other processes and when launching tools we always want to use absolute paths (as we do currently). The whole caching abstraction of Mill happens in Mill, so the conversion between Mill-mapped-paths and absolute paths need to happen in the tasks (and workers). And in Mill, we have the power and the knowledge for translating between both systems.

lefou avatar Oct 29 '25 09:10 lefou

The symlink approach is what is used by Bazel for the same purpose. It's not simple by any means, but the question is if there's anything we can do that's better. An optimistic find-and-replace definitely doesn't seem sufficient, e.g. you cannot find-and-replace paths inside the binary compile.dest/zinc file.

I think the litmus test for this is really what was stated in the original ticket: Can we compile an example project in two separate folders, diff the out/ folders, and find they are identical? This PR solves some of the inconsistencies, but I'd be reluctant to merge it up front since (a) this doesn't give us remote caching for typical projects, e.g. the 5-webapp-scalajs-shared mentioned in the original ticket, where things like scalacOptions and others zinc are required and (b) we don't know if this approach will allow us to eventually solve the other problems, or whether we'll hit a wall and find ourselves needing to rip it out later and re-implement some other approach.

lihaoyi avatar Oct 29 '25 09:10 lihaoyi

As a follow-up to this PR, I propose to use new Args class to encode configuration args that may hold mappable paths.

  • The proposal: https://github.com/com-lihaoyi/mill/discussions/6057
  • A gist of the proposed API: https://gist.github.com/lefou/4ab9e44c119559ff5e31e1e761d552a4

A Mill version with this PR and the Args API applied can properly encode paths, as can be seen in the output of the linked examples. Here is a smaller excerpt:

package build

import mill.*
import mill.api.*

object `package` extends Module {

  def javacOptions: Task.Simple[Args] = Args(
      // single arg
      Arg("-deprecation"),
      // implicit single args
      "-verbose",
      // two args as group
      ArgGroup("--release", "17"),
      // two args as tuple
      ("--release", "17"),
      // an option including a file via ArgParts
      Arg("-Xplugin=", plugin().path),
      // an option including a file via arg string interpolator
      arg"-Xplugin:${plugin().path}",
      // some files
      sources().map(_.path),
      // some files as ArgGroup
      ArgGroup(sources2().map(_.path)*),
      // Mixed ArgGroup
      ArgGroup("--extra", arg"-Xplugin=${plugin().path}", sources().map(_.path))
    )

  def useOptions() = Task.Command {
    val opts = javacOptions()
    println(s"options: ${opts}")
  }

}
> dev-mill useOptions
[5/5] useOptions
[5] options: Args((-deprecation), (-verbose), (--release, 17), (--release, 17), (-Xplugin=/home/lefou/work/opensource/mill-experiment-options/lib/plugin.jar), (-Xplugin:/home/lefou/work/opensource/mill-experiment-options/lib/plugin.jar), (/home/lefou/work/opensource/mill-experiment-options/src/File1.java), (/home/lefou/work/opensource/mill-experiment-options/src/File2.java), (/home/lefou/work/opensource/mill-experiment-options/src/File3.java, /home/lefou/work/opensource/mill-experiment-options/src/File4.java), (--extra, -Xplugin=/home/lefou/work/opensource/mill-experiment-options/lib/plugin.jar, /home/lefou/work/opensource/mill-experiment-options/src/File1.java, /home/lefou/work/opensource/mill-experiment-options/src/File2.java))
[5] build.mill:121 upickle.write(opts): "{\"value\":[{\"value\":[[[\"-deprecation\",null]]]},{\"value\":[[[\"-verbose\",null]]]},{\"value\":[[[\"--release\",null]],[[\"17\",null]]]},{\"value\":[[[\"--release\",null]],[[\"17\",null]]]},{\"value\":[[[\"-Xplugin=\",null],[null,\"$WORKSPACE/lib/plugin.jar\"]]]},{\"value\":[[[\"-Xplugin:\",null],[null,\"$WORKSPACE/lib/plugin.jar\"]]]},{\"value\":[[[null,\"$WORKSPACE/src/File1.java\"]]]},{\"value\":[[[null,\"$WORKSPACE/src/File2.java\"]]]},{\"value\":[[[null,\"$WORKSPACE/src/File3.java\"]],[[null,\"$WORKSPACE/src/File4.java\"]]]},{\"value\":[[[\"--extra\",null]],[[\"-Xplugin=\",null],[null,\"$WORKSPACE/lib/plugin.jar\"]],[[null,\"$WORKSPACE/src/File1.java\"]],[[null,\"$WORKSPACE/src/File2.java\"]]]}]}"

lefou avatar Nov 02 '25 15:11 lefou

CI failures for mill.contrib.docker.DockerModuleTest.docker seem unrelated. Maybe, the used image in the test is no longer available from hub?

lefou avatar Nov 04 '25 03:11 lefou