sdk icon indicating copy to clipboard operation
sdk copied to clipboard

Dart SDK breaking changes policy is missing expectations for the Flutter SDK and google3

Open matanlurey opened this issue 1 year ago • 4 comments

AFAICT, @brianquinlan correctly followed the breaking changes approval process here:

  • https://github.com/dart-lang/sdk/issues/53863
  • https://groups.google.com/a/dartlang.org/g/announce/c/ScSc1iQgyXY

... however that ended up with a build breakage for Flutter due to at least one use of a "fake" of the API (https://github.com/flutter/flutter/blob/1b8742b9dc49c66fadd40a446d3d078046b787f5/packages/flutter_tools/test/src/fakes.dart#L4).

I was looking at https://github.com/dart-lang/sdk/blob/master/docs/process/breaking-changes.md and could not find any mention of how to land breaking changes that will impact either (a) the Flutter SDK, or (b) google3. Both of these repositories are auto-rolled into, meaning they are semantically like a mono-repo, and we do not have full-time staffing capable of pausing rollers and doing migrations.

My suggestion is to add the following addendum to the breaking changes policy:

If a change would break the Flutter tree[^1] or google3[^2], the offending code must be fixed ahead of time, or explicit approval must be granted to land a breaking change without fixing either repository (i.e. a critical security vulnerability).

These expectations are important because the Dart and Flutter teams work together very closely, and auto-rollers mean that multiple teams' output is blocked if a single team/individual introduces a breaking change without landing fixes in advance.

[^1]: I am not sure how the Dart team can/does determine this, and defer to @athomas and @whesse. [^2]: Ditto, I am guessing this is CBuild?

matanlurey avatar Feb 16 '24 22:02 matanlurey

/cc @mit-mit @vsmenon @itsjustkevin - not sure who should own this.

matanlurey avatar Feb 16 '24 22:02 matanlurey

Item Details
Summary The Dart SDK breaking changes policy does not address how to land breaking changes that will impact the Flutter SDK or google3.
Triage to area-meta (high confidence)

(what's this?)

dart-github-bot avatar Feb 16 '24 22:02 dart-github-bot

If I had noticed that Flutter was affected, I would have added that fact the the Impact section of https://github.com/dart-lang/sdk/issues/53863 and forward-fixed it before landing my change.

The problem is that it is hard to know, with certainty, what will break with this sort of change. And testing a change with Flutter before landing it is pretty painful i.e. AFAIK, you'd have to do an engine build and then run the Flutter tests locally.

brianquinlan avatar Feb 17 '24 02:02 brianquinlan

Yeah to be clear I'm not sure how you were expected to do better - I'm hoping whomever updates the policy guide can also explain how to be compliant without tanking productivity.

matanlurey avatar Feb 17 '24 02:02 matanlurey

Let's say we had that in our policy, the problem would still have happened as the user was unaware that they were breaking Flutter. The change was also approved by Flutter: https://github.com/dart-lang/sdk/issues/53863#issuecomment-1804506206. So if the additional paragraph doesn't prevent the issues we want to prevent, why should we change the policy? I'm not totally opposed, given that the text just codifies what we're already doing but it seems like a no-op.

For google3, we're strengthening the presubmit soon and we're already catching pretty much all regressions before rolling anything.

For Flutter, it's trickier. How can the Dart team prevent these regressions? The Dart team has been forced to maintain tooling ourselves (3xH, flutter-frontend-try and flutter-analyze-try, now the Dart & Flutter monorepo). If this build breakage reached flutter/flutter, that means that flutter/engine's testing of SDK rolls is insufficient. Dart's flutter-analyze bot has also gotten out of sync over time (because it doesn't get updated when Flutter's CI changes), which likely was another factor leading to the breakage.

So I think a better policy change would go beyond "you can't break our privileged project" to propose a mechanism by which potentially breaking changes can be vetted. And, that mechanism is maintained by the team that doesn't want to be broken. Similar to the internal “If you liked it, you shoulda put a test on it.” rule and the Flutter team's customer test repo. To guarantee that we're not breaking Flutter, these tests MUST run on the Dart SDK roll PRs. Alternatively, they could also run on the Dart & Flutter monorepo to be detected before roll PRs are created. To enable to the Dart team to effectively test and fix the breakage, there needs to be a Dart presubmit covering this (typically the Dart & Flutter monorepo covers this).

athomas avatar Feb 19 '24 11:02 athomas

Thanks for the response @athomas. My bottom line here is:

  1. The breaking changes approval process should note that we have 2 places that breaking changes must be resolved first
  2. The person who does the breaking change should be responsible for fixing it first

Whether or not we have the tooling for (2) is a different question unfortunately.


Others answers inline:

the problem would still have happened as the user was unaware that they were breaking Flutter

Not entirely true, the user just didn't have the right regular expression when they were looking for possible breakages.

why should we change the policy?

I'm a fan of explicit policies (this is what we DO and DO NOT do) versus having to infer what the real rules are.

For google3, we're strengthening the presubmit soon

👏🏼

For Flutter, it's trickier. How can the Dart team prevent these regressions? The Dart team has been forced to maintain tooling ourselves ... flutter/engine's testing of SDK rolls is insufficient

I understand the pain here, I think. @yjbanov has a great post here complaining from a Flutter POV: https://github.com/flutter/flutter/issues/141734.

Regardless, we need to fix this problem. It's 2024. As long as we're going to use an auto-roller from dart->engine->flutter, we need to have either (a) certain levels of assurances that we're not accidentally breaking the roll, or blocking dozens of other commits because a single commit introduced an accidental breaking change or (b) a new reality that doesn't require rolling. It's frankly unacceptable that we are spending so many iteration cycles of our colleagues effectively running manual CI/CD offline.

... a better policy change would go beyond "you can't break our privileged project"

I don't look at it this way, fwiw. Flutter is not a "privileged project", we just have an auto-roller that is expected to work nearly 100% of the time. Perhaps thats the wrong model and we should be rolling the Dart SDK less often (doubt) or the Dart team gardeners should "own" the SDK -> engine roll, not the engine gardeners.

Put another way: it can't be the responsibility of whomever happens to be the engine CI gardener to notify the Dart team they removed an API that the Flutter SDK uses. That's an extremely inefficient process, causes hours to days of manual reverts/pain, and blocks our entire org from making progress for something that should be preventable.

To enable to the Dart team to effectively test and fix the breakage, there needs to be a Dart presubmit covering this (typically the Dart & Flutter monorepo covers this).

I'm confused what the ask is. What could I do or add that would give you this functionality?

matanlurey avatar Feb 20 '24 22:02 matanlurey

/cc @godofredoc as well ^

matanlurey avatar Feb 20 '24 22:02 matanlurey

@matanlurey I am reading this assuming your asks here are:

  1. Update the policy to be more explicit about the breaking change approval process.
  2. Brainstorm how we may prevent the overall impact, with the caveat that this is a process that will take time and resources.

I'm I misunderstanding your intent?

itsjustkevin avatar Feb 20 '24 22:02 itsjustkevin

Update the policy to be more explicit about the breaking change approval process.

Yes, at this point I'd settle for:

You're expected not to break Flutter, but we have no tools to help you do that, so ¯\_(ツ)_/¯ good luck!

matanlurey avatar Feb 20 '24 23:02 matanlurey

@athomas has explained well what the Dart EngProd team is doing on its CI to catch these breakages. I just wanted to expand on this, with more details and links. But I'm not really making any new points that he didn't, just repeating them.

The Dart CI is working to detect breakage in Flutter engine and framework due to SDK changes, and prevent or rollback those changes, in the following ways:

  1. The monorepo roller and CI detects breakages in engine builds and some framework tests, pinning them down to a small range of commits (merged commits from Dart, engine, and framework). It is available as a tryjob for SDK changes, but not added automatically to any changes.
    1. We would like to add it as a tryjob especially to two types of changes: analyzer or linter changes, and DEPS changes. Significant blockers include the latency and resource use of the tryjob, and the tryjob not testing DEPS changes correctly (the monorepo post-submit CI does test them correctly).
    2. Additional framework tests can be added to the monorepo if they are important sources of failures. For example, the customer_tests shard was added recently.
  2. Many of the hardest rollbacks recently have been analyzer, linter, and deprecation changes. These are also not found until they hit framework because Flutter engine CI does not run these framework analysis tests, causing longer rollbacks and latency. I filed issue https://github.com/dart-lang/sdk/issues/54869 about making the Dart CI's flutter-analyze tryjobs and CI match the Flutter framework's analysis tests, and that will be a big start on catching these before they land in the SDK. That issue mentions both fixing the fast flutter-analyze bot, and adding the tests to the slower monorepo bots. These could also be caught when testing the roll of SDK into engine, which would put the least additional resource demands on the system, but would stop them before they land in engine, not before they land in framework. It makes sense that the engine CI has never run framework analysis tests, since no change in engine could break them, but if they were added just to run on SDK rolls, it would be running only exactly when they are needed.

So I think there is a proposed path to catch more breakages in the Dart SDK before they roll into engine and framework, sometimes even before submitting. We know where the biggest sources of breakage are coming from, and should focus on those areas.

I wouldn't agree on characterizing the sdk->engine and engine->flutter rolls as autorolled, without manual intervention. The rolls are created automatically, and tested, but they require manual rolls or intervention when the tests fail. Flutter has been using resources to handle these manual rolls when needed, through the engine sheriff, so saying that their isn't the manpower to handle rollbacks and manual fixes would be a change in the workflow, moving the responsibility from one place to another. This is not a new responsibility that needs to be assigned - it currently has owners.

Long-term planning to create a logical monorepo, or at least to keep tip-of-tree green and keep breaking changes from landing in any repo (also including plugins and other packages used by framework) are important goals, that do need work devoted to them.

Anything that involves running engine and framework testing on Dart SDK changes will need to run on tip-of-tree Dart, engine, and framework, including tryjobs and local testing, and our current monorepo setup is working well as a starting point for that: CI: https://ci.chromium.org/p/dart/g/monorepo/console and README: https://dart.googlesource.com/monorepo/+/refs/heads/main

(this paragraph is an aside, only related slightly to the issue) An issue that often is mentioned, related to this area, is how to atomically land changes in engine, framework, and Dart SDK when something cannot be landed in a backwards-compatible way. Adding metainformation to CLs and PRs linking them together seems to be the right thing, so they could be handled automatically. Currently, breaking atomic changes are handled manually by the manual rolls. In practice, this issue which seems like a big blocker does not happen that much in practice, and is not happening as much as unintended breakage by rolls that could be backwards compatible or fixed forward in advance.

So this discussion, which focuses on what Dart CI could be doing to detect and stop breakages from rolling in to Flutter engine and framework, should be informed by what the current systems are for detecting and stopping these breakages, and what the next steps should be for these systems. I'm not sure if saying "we have no tools to help you do that, shrug" is because people haven't been made aware of these systems, or if they don't see them as a workable starting point for tackling the problem.

In prior months, the Flutter Engine Sheriff chat has been full of early notifications of breakages in upcoming rolls, posted based on monorepo CI failures. So the impact of breaking SDK changes on Flutter has already been reduced significantly from what it would be otherwise. These have allowed rollbacks to happen early, or fixes to be landed before the roll. Recent failures have been missed because the relevant tests weren't on monorepo, or there has been no awareness of this dashboard as an important thing to check and keep green. So a missing step is perhaps to add more use of monorepo CI into the workflows of the teams, and improve it in the areas where it is lacking (mainly more framework test shards being run on it, and it being used for more tryjobs, both of which raise resource use considerations).

whesse avatar Feb 21 '24 17:02 whesse

I chatted with @athomas offline a bit, but the tl;dr is I don't think the Dart team or Dart infra is doing anything wrong - I'm just looking for more explicit guidance on what folks are supposed to do, and what blocking issue(s) or policy changes are needed to make this easier for our teams.

matanlurey avatar Feb 21 '24 23:02 matanlurey

After back channeling with @matanlurey, I drafted a PR: https://github.com/dart-lang/sdk/pull/54982

Key points:

  • Downstream breakages must be fixed.
  • Breakage = obvious breakages found in code search and breakages detected by downstream presubmits.
  • Downstream presubmit for google3 is ~cbuild~ the global presubmit (go/dart-breaking-change).
  • Downstream presubmits for flutter are the flutter- tryjobs.
  • Downstream presubmits are provided by the downstream team (that last one is mostly implied and requires further discussion).

athomas avatar Feb 22 '24 08:02 athomas