sdk icon indicating copy to clipboard operation
sdk copied to clipboard

build_runner + dev_compiler tests have become flaky on github actions

Open jakemac53 opened this issue 3 years ago • 26 comments

See related issue https://github.com/dart-lang/build/issues/3332

A number of our dev_compiler based integration tests have started either consistently failing or flakily failing. Unfortunately we don't get any chrome logs for these failures, and I haven't been able to figure out a way to get them. The tests also do not fail locally, only on github actions.

I presume this is some async timing related issue, but I can't confirm.

I bisected the issue based on released dev sdks (all I can test on github), the tests pass on 2.18.0-238.0.dev and fail on 2.18.0-250.0.dev.

jakemac53 avatar Jul 12 '22 19:07 jakemac53

cc @Markzipan @nshahan

jakemac53 avatar Jul 12 '22 19:07 jakemac53

Here is an example run, same exact tests and version solve, one passes on 2.18.0-238.0.dev and the other fails on 2.18.0-250.0.dev (you can expand the setup dart step on each to see the version).

jakemac53 avatar Jul 12 '22 19:07 jakemac53

Not sure how this is related, but another anecdote is that the tests which are opted out of null safety consistently fail, while the opted in tests only fail flakily.

jakemac53 avatar Jul 12 '22 19:07 jakemac53

@nshahan A couple of DDC changes were submitted on July 1st that may be the culprit. Most of the changes seem innocuous or unrelated to the timing issues Jake's seeing, so this change is the most suspicious: https://github.com/dart-lang/sdk/commit/624dd76b6f3af2ea84a0e5acfc5ec02f14b3f666#diff-0934de6a6f3b0f87927d259fb0473ea0e4756a222df7f8834eb96e6ffe822c66R85

Specifically the change to declaredNullability vs nullability, which seems to be semantically different according to the CFE. I can't think of an obvious test that could tickle this case and cause timing flakes though.

Markzipan avatar Jul 12 '22 20:07 Markzipan

While doing some debugging on the change that @Markzipan mentioned it looked like we were missing some of the "Undetermined Nullability" erasure that we were attempting to perform so I changed the test condition to use the FutureOrType.declaredNullability field instead of FutureOrType.nullability. My expectation was that this would cause more occurrences to be caught and avoid a compile time error when some pattern in the user code triggered this in the CFE representation (DDC throws when trying to compile a FutureOr with "Undetermined Nullability"). I suppose there could be a pattern that I missed that would result in more "Undetermined Nullability" getting through this erasure but that should result in a compile time failure, and it should be a consistent failure.

@jakemac53 Do you have any idea if the compile is failing? Have you tried running the tests suite locally many times to account for the flakiness?

nshahan avatar Jul 18 '22 20:07 nshahan

There are no compile time errors, I tried running several times locally but I will try also throttling the network to see if that helps repro the issue.

Update: throttling doesn't seem to reproduce it either

jakemac53 avatar Jul 20 '22 18:07 jakemac53

Is there a way I can debug the test running in chrome locally?

nshahan avatar Jul 20 '22 20:07 nshahan

Is there a way I can debug the test running in chrome locally?

Two ways, from the _test_null_safety directory in the build repo:

  1. dart run build_runner serve test and then load http://localhost:8080 in your browser and click the link to the opted_out_test.
  2. dart run build_runner test -- --pause-after-load --run-skipped -p chrome test/opted_out_test.dart, browser will open automatically, this also more directly mimics how the tests will actually load and run, probably more useful.

jakemac53 avatar Jul 20 '22 22:07 jakemac53

I've been investigating and debugging but nothing jumps out at me as an obvious error. Finally I tried building at 2.18.0-238.0.dev and 2.18.0-250.0.dev and diffing all the files under the .dart_tool/build directory. There are many diffs but none in any .js files. For that reason I don't think the failures are due to DDC.

Between 2.18.0-238.0.dev and 2.18.0-250.0.dev there are only three changes to DDC: https://github.com/dart-lang/sdk/commit/0f2e021da8635fd3a6b4db98db884e0bad4852ef https://github.com/dart-lang/sdk/commit/f3daca2ef679ebaf952e677901bd8ec4bc027ab8 https://github.com/dart-lang/sdk/commit/624dd76b6f3af2ea84a0e5acfc5ec02f14b3f666.

There are many more VM and analyzer changes (not sure if analyzer factors into the workflow of running theses tests).

Can we try running with any more dev builds in between as github actions to see if that can narrow down the culprit CLs?

Browsing this link and moving backwards in time makes it easy to see the various CLs that could be the cause https://github.com/dart-lang/sdk/commits/2.18.0-250.0.dev

nshahan avatar Jul 22 '22 22:07 nshahan

I am not aware of any way to run with a custom SDK on github actions... we can only download the sdks available on google cloud storage.

jakemac53 avatar Jul 25 '22 15:07 jakemac53

I did a little digging and I think that all dev releases are uploaded to cloud storage, but not all are signed so they don't appear as available in the dev channel.

The good new is that there is now a release "flavor" named raw that should allow you to run with any dev version https://github.com/dart-lang/setup-dart.

Let's do some further bisecting of the range to reduce the size of the blamelist.

nshahan avatar Jul 25 '22 22:07 nshahan

Trying that here https://github.com/dart-lang/build/pull/3331

jakemac53 avatar Jul 26 '22 17:07 jakemac53

Me and Nick spent some more time today looking into this but have hit a dead end. We were able to get some more information though:

  • The first failing version is 2.18.0-247.0.dev
  • Were were able to observe an exception, "Cannot read properties of undefined (reading 'async')" with this stack:
    at Object.execCb (http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:1696:33)
    at Module.check (http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:878:55)
    at http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:630:33
    at each (http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:59:31)
    at breakCycle (http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:619:17)
    at http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:632:29
    at each (http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:59:31)
    at breakCycle (http://localhost:41[243](https://github.com/dart-lang/build/runs/7527732968?check_suite_focus=true#step:6:247)/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:619:17)
    at http://localhost:41243/eFQET3WQRsE%2BJ3KRouHLGF6iFBeTd7Rw/test/packages/build_web_compilers/src/dev_compiler/require.js:632:29
  • the stack trace isn't helpful though. The async variable appears to get initialized at the very top of the dart_sdk.js file.

jakemac53 avatar Jul 26 '22 20:07 jakemac53

From what I can tell, the only SDK changes at that version are those DDC Future updates and some analyzer-related command and test framework changes: https://github.com/dart-lang/sdk/commit/6ca7a185e036365b57acab3db73e8a301690a2ed

Is it possible that bumping test or webdev caused some odd interactions with DDC? Is it possible to run these tests with arbitrary files reverted/updated?

Markzipan avatar Jul 26 '22 22:07 Markzipan

@Markzipan! You just pointed out something that is so confusing but does add some explanation to what is going on so thank you!!!

I have been looking at this view which makes me think that the DDC changes were part of 2.18.0-246.0.dev. They appear in the timeline before the commit labeled 2.18.0-246.0.dev so that seemed logical to me.

But... Your link to the contents of the merge commit for 2.18.0-247.0.dev shows the changes! If I look at the merge commit for 2.18.0-246.0.dev they are not there! That caused me to look at the change directly and inspect the tags it is present in and what do you know?!? The change is not present in 2.18.0-246.0.dev!

I guess the ordering of merge commits on the timeline doesn't represent what commits are part of the merge. That means timeline view of commits in order is not useful to understand which commits are part of which release (my mistaken assumption). Now that we know the DDC commits are in-fact part of the exact change that is causing the problem I will double check again to see if I can find a problem.

nshahan avatar Jul 26 '22 23:07 nshahan

I just double checked my diffing test and I still see no difference in the output compiled by DDC at the two versions. To document my process:

  1. Install 2.18.0-246.0.dev
  2. rm -rf build/_test_null_safety/.dart_tool
  3. $ cd build/_test_null_safety
  4. $ dart pub get
  5. $ dart run ../build_runner/bin/build_runner.dart test -- --pause-after-load --run-skipped -p chrome test/opted_out_test.dart
  6. $ cp -r .dart_tool ~/dart_tool-2.18.0-246.0.dev
  7. Install 2.18.0-247.0.dev
  8. rm -rf build/_test_null_safety/.dart_tool
  9. $ cd build/_test_null_safety
  10. $ dart pub get
  11. $ dart run ../build_runner/bin/build_runner.dart test -- --pause-after-load --run-skipped -p chrome test/opted_out_test.dart
  12. $ cp -r .dart_tool ~/dart_tool-2.18.0-247.0.dev
  13. $ diff -qr -x '*.dill' ~/dart_tool-2.18.0-246.0.dev ~/dart_tool-2.18.0-247.0.dev

For the diff, I'm excluding .dill files because they all contain diffs. I believe this is because there is a dart sdk version encoded into them. The remaining results are:

Files dart_tool-2.18.0-246.0.dev/build/fcd1995bc647fb959e82ea360c6c2c9a/asset_graph.json and dart_tool-2.18.0-247.0.dev/build/fcd1995bc647fb959e82ea360c6c2c9a/asset_graph.json differ
Files dart_tool-2.18.0-246.0.dev/build_resolvers/sdk.sum.deps and dart_tool-2.18.0-247.0.dev/build_resolvers/sdk.sum.deps differ
Files dart_tool-2.18.0-246.0.dev/package_config.json and dart_tool-2.18.0-247.0.dev/package_config.json differ
Only in dart_tool-2.18.0-246.0.dev/pub/bin/test: test.dart-2.18.0-246.0.dev.snapshot
Only in dart_tool-2.18.0-247.0.dev/pub/bin/test: test.dart-2.18.0-247.0.dev.snapshot

The diffs in sdk.sum.deps and package_config.json are only the SDK version numbers and timestamps. That just leaves the asset_graph.json and test.dart snapshots as potential differences.

nshahan avatar Jul 29 '22 18:07 nshahan

Fwiw I have narrowed this down to the commit with this sha 624dd76b6f3af2ea84a0e5acfc5ec02f14b3f666. A custom sdk built from the tag for 2.18.0-246.0.dev passes, and cherry picking in that sha it fails. Trying to see if I can figure out any more meaningful diffs.

Update: I also only see a diff in the .dill files but not the .js files.

jakemac53 avatar Jul 29 '22 21:07 jakemac53

Couple of things that might help with debugging:

  • I noticed that chrome's stdout and stderr are not printed out in package:test, somewhere here: https://github.com/dart-lang/test/blob/b6aba5544628730b7d6a38eae1aef9117a1bb235/pkgs/test/lib/src/runner/browser/chrome.dart#L80 that is what we do in our tests in webdev: https://github.com/dart-lang/webdev/blob/6b93e2756e3fd410d4618274e2ee62c75416c8ec/dwds/test/fixtures/context.dart#L190

  • I also could not find if package:test listens to chrome debug console. Maybe worth adding? We do it here in webdev (https://github.com/dart-lang/webdev/blob/37bf4af1b0961f053e7cf4990a34ca5fd0ad0fd4/dwds/lib/src/services/chrome_proxy_service.dart#L918) that calls WipConnection.runtime.onConsoleAPICalled down the stream. I could not find calls to it anywhere in package:test....

  • I tried to repro a mix-mode app run with weak null safety in webdev tests, but it runs fine in the CI so I am not sure if DDC is the culprit... Is it possible that some library update in build repo have caused it? Here is my try: https://github.com/dart-lang/webdev/pull/1707. Specifically, the e2e tests that run evaluateInFrame or evaluate would be failing it we had a repro: https://github.com/dart-lang/webdev/runs/7618977685?check_suite_focus=true:

image

annagrin avatar Aug 01 '22 22:08 annagrin

The diffs in sdk.sum.deps and package_config.json are only the SDK version numbers and timestamps. That just leaves the asset_graph.json and test.dart snapshots as potential differences.

Just had another thought - given the trace you mentioned above seems to indicate that async library import is used but is not defined in one of the .js modules - maybe the asset.graph.json is worth looking into...

annagrin avatar Aug 01 '22 23:08 annagrin

Here is the start of a diff between the dart_sdk.js files. The + are the changes we sometimes see where it breaks.

@@ -1,5 +1,6 @@
-define([], (function load__dart_sdk() {
+define(['dart_sdk'], (function load__dart_sdk(dart_sdk) {
   'use strict';
+  const async = dart_sdk.async;
   const _library = Object.create(null);
   const dart = Object.create(_library);
   dart.library = _library;
@@ -13,7 +14,7 @@
   var _js_primitives = Object.create(dart.library);
   var _metadata = Object.create(dart.library);
   var _native_typed_data = Object.create(dart.library);
-  var async = Object.create(dart.library);
+  var async$ = Object.create(dart.library);
   var collection = Object.create(dart.library);
   var convert = Object.create(dart.library);
   var developer = Object.create(dart.library);
@@ -293,28 +294,29 @@
     ObjectAndStackTraceTovoid: () => (T$.ObjectAndStackTraceTovoid = dart.constFn(dart.fnType(dart.void, [core.Object, core.StackTrace])))(),
     VoidTovoid: () => (T$.VoidTovoid = dart.constFn(dart.fnType(dart.void, [])))(),
     dynamicTovoid: () => (T$.dynamicTovoid = dart.constFn(dart.fnType(dart.void, [dart.dynamic])))(),
-    _FutureOfNull: () => (T$._FutureOfNull = dart.constFn(async._Future$(core.Null)))(),
+    _FutureOfNull: () => (T$._FutureOfNull = dart.constFn(async$._Future$(core.Null)))(),
     VoidTo_FutureOfNull: () => (T$.VoidTo_FutureOfNull = dart.constFn(dart.fnType(T$._FutureOfNull(), [])))(),
     FutureOfNull: () => (T$.FutureOfNull = dart.constFn(async.Future$(core.Null)))(),
     FutureNOfNull: () => (T$.FutureNOfNull = dart.constFn(dart.nullable(T$.FutureOfNull())))(),
     ObjectTovoid: () => (T$.ObjectTovoid = dart.constFn(dart.fnType(dart.void, [core.Object])))(),
-    dynamicToFuture: () => (T$.dynamicToFuture = dart.constFn(dart.fnType(async.Future, [dart.dynamic])))(),
-    _FutureOfString: () => (T$._FutureOfString = dart.constFn(async._Future$(core.String)))(),
-    _FutureOfbool: () => (T$._FutureOfbool = dart.constFn(async._Future$(core.bool)))(),
+    dynamicToFuture: () => (T$.dynamicToFuture = dart.constFn(dart.fnType(async$.Future, [dart.dynamic])))(),
+    _FutureOfString: () => (T$._FutureOfString = dart.constFn(async$._Future$(core.String)))(),
+    _FutureOfbool: () => (T$._FutureOfbool = dart.constFn(async$._Future$(core.bool)))(),
     VoidTobool: () => (T$.VoidTobool = dart.constFn(dart.fnType(core.bool, [])))(),
     boolToNull: () => (T$.boolToNull = dart.constFn(dart.fnType(core.Null, [core.bool])))(),
     voidToNull: () => (T$.voidToNull = dart.constFn(dart.fnType(core.Null, [dart.void])))(),
-    _FutureOfint: () => (T$._FutureOfint = dart.constFn(async._Future$(core.int)))(),
+    _FutureOfint: () => (T$._FutureOfint = dart.constFn(async$._Future$(core.int)))(),
     ObjectAndStackTraceToNull: () => (T$.ObjectAndStackTraceToNull = dart.constFn(dart.fnType(core.Null, [core.Object, core.StackTrace])))(),
-    FutureOfvoid: () => (T$.FutureOfvoid = dart.constFn(async.Future$(dart.void)))(),
+    FutureOfvoid: () => (T$.FutureOfvoid = dart.constFn(async$.Future$(dart.void)))(),
     VoidToFutureOfvoid: () => (T$.VoidToFutureOfvoid = dart.constFn(dart.fnType(T$.FutureOfvoid(), [])))(),
-    EventSinkTo_ConverterStreamEventSink: () => (T$.EventSinkTo_ConverterStreamEventSink = dart.constFn(dart.fnType(convert._ConverterStreamEventSink, [async.EventSink])))(),
+    EventSinkTo_ConverterStreamEventSink: () => (T$.EventSinkTo_ConverterStreamEventSink = dart.constFn(dart.fnType(convert._ConverterStreamEventSink, [async$.EventSink])))(),
     JSArrayOfUint8List: () => (T$.JSArrayOfUint8List = dart.constFn(_interceptors.JSArray$(typed_data.Uint8List)))(),
     ObjectNAndObjectNTovoid: () => (T$.ObjectNAndObjectNTovoid = dart.constFn(dart.fnType(dart.void, [T$.ObjectN(), T$.ObjectN()])))(),
     ObjectNToObjectN: () => (T$.ObjectNToObjectN = dart.constFn(dart.fnType(T$.ObjectN(), [T$.ObjectN()])))(),
     EmptyIteratorOfNeverL: () => (T$.EmptyIteratorOfNeverL = dart.constFn(_internal.EmptyIterator$(dart.legacy(dart.Never))))(),

natebosch avatar Aug 02 '22 18:08 natebosch

We have now been able to reproduce this issue locally if you limit the number of workers using an environment variable first. $ export BUILD_MAX_WORKERS_PER_TASK=1

We believe the issue is related to the persistent worker mode and what it uses as the dependencies of a given compile. In the example diff @natebosch posted above, it looks like during this particular compile of the weak mode SDK module, the dart_sdk module is shown as a dependency. There should be no dependencies in the SDK module. During other runs we saw the same problem manifest in the sound mode SDK module. This leads us to believe that the change I made in DDC has slightly changed the timing of the builds and causes an interleaving of compiles that doesn't correctly set the right dependencies for the SDK modules.

nshahan avatar Aug 02 '22 19:08 nshahan

  • I noticed that chrome's stdout and stderr are not printed out in package:test, somewhere here: https://github.com/dart-lang/test/blob/b6aba5544628730b7d6a38eae1aef9117a1bb235/pkgs/test/lib/src/runner/browser/chrome.dart#L80 that is what we do in our tests in webdev: https://github.com/dart-lang/webdev/blob/6b93e2756e3fd410d4618274e2ee62c75416c8ec/dwds/test/fixtures/context.dart#L190

We print browser output in cases where we time out waiting for the bootstrapping code to connect a channel back to the runner (though that hasn't been as useful as we'd like). We don't currently do anything with the browser output when a test fails or times out, which is how this is surfacing.

Should we look for some other heuristic for when browser output might be useful? Maybe any failure?

Should we consider a flag to emit browser output?

cc @jakemac53 @grouma

natebosch avatar Aug 04 '22 00:08 natebosch

Should we look for some other heuristic for when browser output might be useful? Maybe any failure?

I believe Chrome always adds some logs to the STDERR. If I'm mistaken, it probably makes sense to log the STDERR whenever there is a test failure.

Should we consider a flag to emit browser output?

Does it make sense to capture the output and conditionally log it in https://pub.dev/packages/browser_launcher?

grouma avatar Aug 04 '22 04:08 grouma

Fwiw I did add some extra logging in a branch of package:test I was using to debug this issue, but it didn't output anything helpful.

jakemac53 avatar Aug 04 '22 14:08 jakemac53

Fwiw I did add some extra logging in a branch of package:test I was using to debug this issue, but it didn't output anything helpful.

How did you get the stack trace from https://github.com/dart-lang/sdk/issues/49441#issuecomment-1195942231 ?

natebosch avatar Aug 04 '22 15:08 natebosch

That was by listening for runtime exceptions through the debug service https://github.com/dart-lang/test/commit/2a8455e873b16dad64c9d583db7adb782232a371

jakemac53 avatar Aug 04 '22 15:08 jakemac53

Setting as P3 because workaround was landed to avoid the issue entirely. It still would be nice to fix to avoid any possible new issues in the future though.

nshahan avatar Nov 18 '22 23:11 nshahan