test Add support for batch compiling all test entry points

Currently when doing dart test or flutter test we seem to load *_test.dart files one by one.

Would it be possible to change the implementation of test loading to support batch loading, so that all entry points are loaded together?

The idea here is to reduce compilation overhead: i.e. compile all *_test.dart files together into Kernel using a single CFE process and then use this Kernel blob to run individual test entry points.

/cc @jakemac53 @natebosch /cc @jensjoha @derekxu16

Nov 14 '23 09:11 mraleph

We currently handle test suites independently as soon as we read them. Batching up all tests for a platform before taking any action on an individual suite will require some restructuring of the runner. I don't have any estimate for the work involved.

We do use a FrontendServerClient and I think in principle we can reuse state between compiles, but we currently reset the client between compiles. Does that cause each suite to do the full work of the compile?

@jakemac53 do you recall if we looked at keeping state between suite compiles?

Nov 14 '23 16:11 natebosch

Afaik we do provide the previous compilation result when we compile each test, which isn't perfect but generally works pretty well assuming most tests share roughly the same transitive deps.

Grouping the tests into a single compile also means that no tests can run if a single test fails to compile, which could be very annoying during development of new features (it may be intentional that some of your tests can't be ran yet).

Nov 14 '23 16:11 jakemac53

Grouping the tests into a single compile also means that no tests can run if a single test fails to compile, which could be very annoying during development of new features (it may be intentional that some of your tests can't be ran yet).

This might be most useful as a flag you enable for CI, or maybe for some interactive cases where you expect everything to compile. When the code is being edited, it's more likely that running a single test suite will be more useful than running all test suites with a single compile.

Nov 15 '23 00:11 natebosch

Grouping the tests into a single compile also means that no tests can run if a single test fails to compile, which could be very annoying during development of new features (it may be intentional that some of your tests can't be ran yet).

This did come up in our discussions with @jensjoha and I don't think compile time errors preclude batch compiles because you can just prune libraries which contain compilation errors from the dill and get runnable dill. This way you get to have your cake and eat it too: you report all compilation errors and you run all tests without compilation errors.

Nov 15 '23 11:11 mraleph

We don't actually read the kernel file at all and there is no public API for doing so, so I am not sure how we could do that. Possibly the frontend server could have a special entry point and handle this logic (as well as reporting exactly which files failed)?

Either way though, I am not currently convinced that the strategy we have right now is insufficient, when I run VM tests locally in real packages I never see any loading messages after the very first one, test compilation is not getting in my way or dominating the overall time to run tests in real world packages that we maintain.

I could see if a package has a single tiny unit test in each file this could dominate more but I don't know that we need to optimize for that scenario (and it should still perform decently anyways).

Nov 15 '23 17:11 jakemac53

I just did some measurements for @dnfield, so I am going to include them here. If we take test/rendering folder of Flutter package and run all tests from a clean state (flutter clean), on M1:

$ time flutter test -v test/rendering 2>&1 | tee /tmp/testing.log
... 

________________________________________________________
Executed in   20.94 secs    fish           external
   usr time   46.18 secs    0.10 millis   46.18 secs
   sys time    9.84 secs    2.05 millis    9.83 secs
$  cat /tmp/testing.log | grep "Compiling.*took.*" | awk '{t+=substr($6, 1, length($6)-2)} END{print NR, t}'
71 9956

So there are 71 test files and 10s spent in compilation (out of total 20 seconds of runtime).

Now if a manually collapse all tests together by renaming _test.dart files into _t.dart files and creating a single mega_test.dart which imports all of *_t.dart files and invokes all of mains I get the following:

$ flutter test -v test/rendering 2>&1 | tee /tmp/testing2.log
$ cat /tmp/testing2.log | grep "Compiling.*took.*" | awk '{t+=substr($6, 1, length($6)-2)} END{print NR, t}'
1 3736

So this will shave ~6.3 seconds, which is approximately 1/3 of the total running time.

This is on M1 - on GH Actions this translates into something more like 30s of savings of 100s total testing time, because I have observed GH to be 5-6x slower than my M1.

That being said: savings on compilation are just part of the story here. Having mega dill allows to use an isolate group for running tests (spawn test process with mega dill loaded as "root" isolate - then run individual test entry points by spawning isolates from root isolate). This will allow to share all kinds of VM metadata and JITed code. Which would lead to faster tests.

We don't actually read the kernel file at all and there is no public API for doing so, so I am not sure how we could do that. Possibly the frontend server could have a special entry point and handle this logic (as well as reporting exactly which files failed)?

Yep, I would think this is going to be an extension to CFE/frontend server logic.

Nov 15 '23 20:11 mraleph

I can't speak to flutter tests, it is a different platform that likely has very different behavior, and also different constraints.

That platform is developed in the flutter repo so we might want to move this issue there?

Nov 15 '23 21:11 jakemac53

If you think that no changes will be necessary in package:test or package:test_core to fix it on Flutter side then I guess we should just close this one.

I can't easily reproduce any bad behaviour by doing dart test with synthetic inputs (e.g. a bunch of tests which all import analyser package). So I guess it is only flutter test that is somehow suffering from these overheads.

Nov 15 '23 21:11 mraleph

If you think that no changes will be necessary in package:test or package:test_core to fix it on Flutter side then I guess we should just close this one.

Well, now that you mention it we probably would need to do something in order to enable batch compilation at all, because today the way platforms work they are only told to "load" (which today involves compiling) individual tests.

So maybe it would need to be a special kind of platform which instead is given all the tests at once to "load".

But, I would start instead by trying to improve upon the existing model (which has other benefits) first, since it does work well for VM tests, it should also be able to work well for flutter tests?

Nov 15 '23 21:11 jakemac53

Well, now that you mention it we probably would need to do something in order to enable batch compilation at all, because today the way platforms work they are only told to "load" (which today involves compiling) individual tests.

Yes, this is what I was referring to in the first paragraph of https://github.com/dart-lang/test/issues/2138#issuecomment-1810650889

Adding support for batch compilation could involve some significant changes to the test runner.

Nov 15 '23 21:11 natebosch

I brought this up in another context, but will put it here just to surface it again, one reason we don't compile all tests together is that they don't necessarily support the same platforms (namely, dart: libraries).

So it might be intentional that some tests don't for instance compile for the VM, because they are specifically targetting web (and have dart:html etc imports).

In general I think flutter handles this a lot differently than the normal test runner, so I am not sure if this is relevant or not to flutter, but it does matter for general Dart tests.

Nov 15 '23 21:11 jakemac53

one reason we don't compile all tests together is that they don't necessarily support the same platforms (namely, dart: libraries).

Yes, and this will mainly limit where and how we can add batching. We filter which suites get passed to a given platform, so it should be feasible to have a filtered batch to compile.

Edit: Also, as long as the compilation errors are only surfaced when the platform would want to run the test, the filtering shouldn't even need to happen before compile (outside of performance concerns).

Nov 15 '23 21:11 natebosch

I wrote a custom plugin that overrides load so that it uses something already compiled. It did not require any changes to package test.

Nov 15 '23 22:11 dnfield