mill icon indicating copy to clipboard operation
mill copied to clipboard

Fine-grained selective testing at a class-level granularity (1500USD Bounty)

Open lihaoyi opened this issue 1 year ago • 9 comments


From the maintainer Li Haoyi: I'm putting a 1500USD bounty on this issue, payable by bank transfer on a merged PR implementing this.

See https://github.com/orgs/com-lihaoyi/discussions/6 for other bounties and the terms and conditions that bounties operate under


https://github.com/com-lihaoyi/mill/pull/4091 implements selective execution and testing at a granularity of mill build modules, but we should be able to do better.

There are two main implementation strategies we could use

Method Code Hash Signature Bytecode Analysis

  • We already have the ability to compute code signatures for JVM bytecode and do basic callgraph analysis, which we currently use for selectively invalidating targets in the build.mill file.

  • We should be able to apply the same bytecode analysis to classes and methods in the JavaModule and ScalaModule compile output, automatically determining which test classes within a module could potentially be affected by a code change and would need to be executed to validate it

This would provide much finer grained selective testing than the initial module-level implementation, and help greatly in messy real-world projects where the module structure is not as fine grained as it could be. It should also work for Java and Kotlin-JVM modules, since the analysis happens on JVM bytecode which is universal across JVM languages

The implement strategy would be:

  • Add a new task JavaModule#methodCodeHashSignatures, that runs the same mill.core.codesig logic we currently run on MillBuildRootModule, but makes the signatures available for any JavaModule.

  • Add a new command TestModule#testQuick, that compares the previous and current methodCodeHashSignatures (maybe using a persistent task to store the old value), and selects the test classes to run based on those who have a method in methodCodeHashSignatures that changed

  • The first time testQuick is run it would run all the tests, but if you change code subsequent runs of testQuick would only run test classes that either (1) failed previously or (2) were affected by the code changes, via the hash signature computation above

Zinc Incremental Compilation Dependency Graph

  • We already have analysis files that the Zinc incremental compiler generates, which contains file-by-file dependency information in the local codebase. We can use this dependency graph similarly as above, and find the test classes that were affected by any code change

lihaoyi avatar Dec 11 '24 15:12 lihaoyi

I'm currently trying to implement the first approach. The implementation is basically what you've highlighted: create methodCodeHashSignatures for JavaModule, and move from there. I did some adjustment to call graph analyst to let it splits out changed classes only. This data will be used to select which test should run. One thing I've noticed while testing is that: I've updated TestRunnerTestUtils.testrunnerWorkStealing by update the enableParallelism from true to false. The spanningInvalidationTree.json was as follow:

{
  "def mill.scalalib.TestRunnerTestUtils$testrunnerWorkStealing$#enableParallelism()boolean": {
    "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule#enableParallelism()boolean": {
      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$#testParallelism$$anonfun$1$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
        "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$!testParallelism$$anonfun$1$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$.$deserializeLambda$(java.lang.invoke.SerializedLambda)java.lang.Object": {},
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$#testParallelism$$anonfun$1()mill.define.Target": {
            "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$!testParallelism$$anonfun$1()mill.define.Target": {
              "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$utest$#testParallelism()mill.define.Target": {}
            }
          }
        }
      },
      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage#testParallelism$$anonfun$3$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
        "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage!testParallelism$$anonfun$3$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage.$deserializeLambda$(java.lang.invoke.SerializedLambda)java.lang.Object": {},
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage#testParallelism$$anonfun$3()mill.define.Target": {
            "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage!testParallelism$$anonfun$3()mill.define.Target": {
              "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage#testParallelism()mill.define.Target": {
                "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage!testParallelism()mill.define.Target": {
                  "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage.testParallelism$(mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage)mill.define.Target": {
                    "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage.testParallelism$(mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$DoneMessage)mill.define.Target": {
                      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$doneMessageFailure$#testParallelism()mill.define.Target": {},
                      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$doneMessageSuccess$#testParallelism()mill.define.Target": {},
                      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$doneMessageNull$#testParallelism()mill.define.Target": {}
                    }
                  }
                }
              }
            }
          }
        }
      },
      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$#testParallelism$$anonfun$4$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
        "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$!testParallelism$$anonfun$4$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$#testParallelism$$anonfun$4()mill.define.Target": {
            "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$!testParallelism$$anonfun$4()mill.define.Target": {
              "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$#testParallelism()mill.define.Target": {}
            }
          },
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$ziotest$.$deserializeLambda$(java.lang.invoke.SerializedLambda)java.lang.Object": {}
        }
      },
      "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$#testParallelism$$anonfun$2$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
        "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$!testParallelism$$anonfun$2$$anonfun$1(scala.collection.immutable.Seq,mill.api.Ctx)mill.api.Result": {
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$#testParallelism$$anonfun$2()mill.define.Target": {
            "call mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$!testParallelism$$anonfun$2()mill.define.Target": {
              "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$#testParallelism()mill.define.Target": {}
            }
          },
          "def mill.scalalib.TestRunnerTestUtils$TestRunnerTestModule$scalatest$.$deserializeLambda$(java.lang.invoke.SerializedLambda)java.lang.Object": {}
        }
      }
    }
  }
}

As you can see, call graph only reached all the TestRunnerTestModule-related classes. It did not reach the TestOnlyTester.testOnly0 which is the place that used testrunnerWorkStealing, and because of that, all test that used TestOnlyTester (TestRunnerScalatestTests, TestRunnerUtestTests) were not reached.

Is the call graph analyst worked as intended in this case? If so, we'll miss quite some test classes.

It seems like it worked as intended, because in the end, the change is captured by a Task, which we've skipped during graph traversal

HollandDM avatar Mar 14 '25 05:03 HollandDM

@HollandDM Just bringing the conversation back here for topics not specific to either of the two PRs.

There's a lot of space to explore in this ticket, but I do not want an open ended investigation of callgraph analysis algorithms to be required for completion, so let's go with the following completion criteria:

  1. Evaluate selective testing using three options:
    1. Bytecode callgraph analysis
    2. Zinc dependency-graph analysis
    3. Both of the above (i.e. only select a test class if it is invalidated in both bytecode callgraph analysis and zinc dependency-graph analysis)
  2. For each of these options, I would like to see:
    1. What cases each option is unable to correctly handle, either selecting too few test classes or selecting too many
    2. What cases each option can handle that other options cannot handle
    3. What kind of performance overhead these APIs impose on e.g. how much %/milliseconds of over head performing the analysis on Mill's own scalalib.test suite adds
  3. From this evaluation, we will then pick one of the options to implement, and flesh it out 1. Unit tests 2. Integration tests under integration/invalidation/ 3. A minimal example test to demonstrate the functionality and show it to users.

So the outcome is we'll have a thorough evaluation and understanding of the three implementation options that are available now, pick one to flesh out, and merge it. And we can leave more experimental discussion and exploration of improved analysis algorithms to future work.

lihaoyi avatar Mar 27 '25 13:03 lihaoyi

@lihaoyi I'll try come up with some write up about these options in the next few days in this comment:

Zinc dependency-graph analysis

This design use Zinc's output to build the invalidated test class list.

What we got?

From Zinc, we can extract Analysis object from the file compile.dest/zinc and parse it using existing technique in ZincWorkerImpl. The Analysis gives us:

  1. Source file <-> classes (products) mappings: the "classes" is not the compile output .class files, but the information Zinc got from running an instrumented Scala compiler. For examples, with this source file:
// in class Person.scala
package test

final case class Person(name: Name, age: Age)

object Person {
   final case class Age(age: Int)
   final case class Name(name: String)
}

we'll have these class test.Person, test.Person$, test.Person.Age, test.Person.Name. For java, Zinc will run the default javac first, then try to analyze the .class output file. Because of how java code are structured, one source file normally only have 1 classes with the similar names. They can have more classes when the source file contain local(nested) class / anonymous class / hidden class (checked using Class.getCanonicalName()).

  1. Relations between classes: Zinc also exposes some generic relation between those classes, e.g: class A depends on class B internally/externally, and vise verse. With these relation, we can span out from an initial set of invalidated classes to decide which class should be re test.

  2. Source file <-> classes file mappings: Zinc also has mapping between source file and output ".class" files.

  3. Source/classes file stamps: These file stamps are normally file's content hash, used by Zinc internally to decide which class should be compiled or not.

Unfortunately, Zinc does not have the mapping between classes (test.Person, test.Person$, test.Person.Age, test.Person.Name above) to the output .class files.

Algorithm

Using the informations from Zinc, we can use this algorithm to identify test classes that need to be re-run between compile commands:

  1. Get all the changed source files, we can do it be comparing file stamps of previous run to the current run.

  2. Get all the classes from the changed source files, then span out from them to get all invalidate classes.

  3. From the invalidate classes, get back all the source files and mark them as invalidating source files.

  4. From the invalidating source files, get all the output .class files. These are the .class file that will be considered for a re-test.

  5. Because test command take in Java's class name as test input, we will have to get them from all the .class file above. One solution is to follow Zinc by copying it's class parser, parse all the .class files, and get the classNames. But because for the test module we know that we'll only run tests belong to testClasspath(), we just get all .class file that prefixed by testClasspath(), remove ".class" suffix and replace "/" to "." to get test class name.

The result

From the above algorithm, it's pretty clear that this approach can re-test more test classes that needed. Using the same example above, if changes happened in test.Person.Age class, then the source file Person.scala will be marked as invalidated, and there fore all test.Person, test.Person$, test.Person.Age, test.Person.Name will be marked, although test.Person.Name is not affected by test.Person.Age at all. This case can be better handled by Bytecode callgraph analysis because it works on .class files directly, so changed like this will only mark ../test/Person$Age.class.

For the performance overhead, I'll add some metric later. But this approach in theory is very lightweight. The only time consuming operations IMO are parsing the Analysis file from disk (I/O bounded). The dependencies graph is also smaller than Bytecode callgraph analysis, so the spanning process is much shorter. If I recalled correctly, there is almost no overhead with this approach when I test it locally.

Bytecode callgraph analysis

The bytecode callgraph analysis using existing methods of mill to identify which method calling which method. The existing call graph analysis is pretty completed, we only need to remove special treatment for Task's methods. We take the spanningInvalidationTree and traverse through all of its nodes, extract the class names along the way.

The result

bytecode callgraph analysis is more fine grained than zinc dependency-graph analysis, it can never miss a class that zinc dependency-graph analysis detected. It also work directly on .class files itself, so it saves us some look up time. The downside is that we need to calculate every be ourself, and the callgraph is way bigger than zinc's callgraph (more nodes and more edges), so overhead will be much bigger.

One potential update for this approach (and maybe also for zinc approach), is that instead of try to start from every method calls and span out, we can start with all methods called from testClasspath() instead. Doing it like this make sure changes that are not related to the module's test classes are not considered when spanning out, reduce the size of the call graph

HollandDM avatar Mar 28 '25 02:03 HollandDM

Your evaluation isn't really rigorous enough to draw any conclusions. Please write some integration tests for both PRs that clearly demonstrate the cases that each approach is able to handle that the other cannot, and perform ad hoc benchmarks so we can see the time taken to perform the analysis on scalalib.test, in milliseconds

lihaoyi avatar Mar 29 '25 21:03 lihaoyi

I've finally have some more free time to work on this, for now I introduced integration tests for both approaches. Put time taken aside, it seesm methodCallGraph method yield more fine grain result than Zinc-based approach. Details are commented in the 2 repos. I'll add some time calculation later this weekend

HollandDM avatar Apr 11 '25 14:04 HollandDM

Thanks @HollandDM!

For next steps, let's go with the methodCallGraph/codesig approach. I expect there'll be scenarios where the codesig-based approach is inferior to the Zinc-based approach, but given the performance is comparable I think the benefit of re-using the logic with the codesig task invalidation wins out here: any improvements we make to the codesig module will benefit both use cases. And we cannot achieve the same with Zinc's analysis file, since it's class/file-level granularity means it will never be able to give us the task/method-level granularity we need for the build.mill files

Next steps would be:

  1. Closing the Zinc-based PR and focusing on the codesig PR

  2. Adding some docs to the various testing.adoc page explaining what is testQuick and how to use it, with example tests

  3. Running the callgraph analysis logic through a profiler (e.g. VisualVM or Jprofiler) to see what the bottlenecks are, and if they're easily fixable then fix them for some free performance.

  4. Integration tests for testQuick, including the edge cases where it doesn't do well or does worse than the Zinc based approach (are there any?)

Thanks for your work so far, looking forward to landing this!

lihaoyi avatar Apr 16 '25 16:04 lihaoyi

The codesig approach seem also be applicable for code compiled by different compilers including Kotlin-written code, whereas the zinc approach is limited to anything compiled with zinc. Maybe, there is a chance to make it a library that can be used outside of Mill, too.

lefou avatar Apr 19 '25 06:04 lefou

Hello @lihaoyi,

  1. Closed.
  2. Running profiler with VisualVM show that callGraphAnalysis only takes 1.5s to finish on average. After a closer look, the only thing that can be fixed is to using some mutable code instead of using scala's collection methods (e.g: CallGraphAnalysis.indexGraphEdges is using a lot of toArray and concat between them, we can instead use a builder for this), but the gain is minimal (1.5s -> 1.4s on average). I don't think trading that for readability is worth.
  3. The ticket already have the integration test, can i add some comment if it need to be adjusted, or do we need more tests?
  4. I think I will add adoc files when we're good with the integration tests

HollandDM avatar Apr 19 '25 13:04 HollandDM

@HollandDM got it, I think in that case the performance is probably fine for now, 0.1s probably isn't worth the optimizations

I think those integration tests are probably enough, just need example tests and docs we can use to explain this feature to users

lihaoyi avatar Apr 23 '25 20:04 lihaoyi