shorebird icon indicating copy to clipboard operation
shorebird copied to clipboard

fix: Investigate remaining reports of iOS slowness after patching.

Open eseidel opened this issue 10 months ago • 18 comments

Shorebird on iOS works different from all other platforms. On iOS, due to store restrictions, any new code sent to devices must be interpreted. Interpreters are slow, and Shorebird's is no exception. However what makes Shorebird fast after patching (even on iOS) is that we do some extra work do determine (at a per-function level) where the changes in your program occurred and any non-changed code we run on the CPU (at full speed) rather than in our interpreter. However the part of our system that makes this determination (we call it the "linker") has historically sometimes gotten confused and decided that more of your program changed than was necessary. The reasons why are complicated, but stem from the fact that Dart's compiler is a "chaotic system" at times. Small changes to input can produce large changes to output. I gave a talk on this a bit here: https://docs.google.com/presentation/d/1mnh8XrB1JKlhhCMWufS4vbRJobepN08KoeXTcziRkXc/edit#slide=id.g31edb94da79_0_148

We spent months in 2024 removing almost(!) all cases of this confusion, but there are still a couple more. See that work on https://github.com/shorebirdtech/shorebird/issues/1892. We've had 3 reports of slow-downs on iOS in 2025. This bug tracks resolving those.

If you see a slowdown on iOS, the best way to help is to share the patch-debug.zip file created during shorebird patch ios. If you're using CodeMagic, this file is saved in the build archives for that patch run. If you're using another setup, you would need to manually collect the file.

The two patch-debug.zip files I've seen in 2025 have the following reasons:

One:

📊 Linker Stats
--------------------------------------------------------------------
🔎 Diffing: before -> after.optimized
📦 Patch Code: 27.87 MB  
🔗 Link Percentage: 66.81%
❌ Failed to Link 24391 (9.25 MB) / 111321 (27.87 MB) objects (33.19%)

Class Table: 0 classes added
Dispatch Table: 0 selectors added
Object Pool: 0 objects added
  Total: 167284
  Backfilled: 7653 (6.87%)
  Missing Mappings: 617

--------------------------------------------------------------------
🔗 Unlinked Sources by Reason (12587)
--------------------------------------------------------------------
-7.39% Directly Modified (353)
-9.59% Object Pool Load (7184)
-2.20% Unknown (260)
-10.04% Load Static Field Address (915)
-3.75% Dispatch Table Call (3875)

Two:

📊 Linker Stats
--------------------------------------------------------------------
🔎 Diffing: before -> after.optimized
📦 Patch Code: 11.25 MB  
🔗 Link Percentage: 36.98%
❌ Failed to Link 17510 (7.09 MB) / 45692 (11.25 MB) objects (63.02%)

Class Table: 5 classes added
Dispatch Table: 0 selectors added
Object Pool: 62 objects added
  Total: 62462
  Backfilled: 1426 (3.12%)
  Missing Mappings: 444
--------------------------------------------------------------------
🔗 Unlinked Sources by Reason (9039)
--------------------------------------------------------------------
-58.13% Dispatch Table Call (7912)
-1.22% Directly Modified (188)
-2.49% Load Static Field Address (182)
-1.11% Object Pool Load (684)
-0.02% Unknown (73)

eseidel avatar Mar 13 '25 16:03 eseidel

I'm working on reason one (the "Load Static Field Address" causes) right now. I have a half-completed patch I expect we'll ship sometime next week. (I'm a bit busy the next two days preparing a conference talk and handling other non-coding responsibilities).

eseidel avatar Mar 13 '25 16:03 eseidel

Got one more report this week via email:

📊 Linker Stats
--------------------------------------------------------------------
🔎 Diffing: before -> after.optimized
📦 Patch Code: 14.15 MB  
🔗 Link Percentage: 83.66%
❌ Failed to Link 7002 (2.31 MB) / 49907 (14.15 MB) objects (16.34%)

Class Table: 0 classes added
Dispatch Table: 0 selectors added
Object Pool: 0 objects added
  Total: 151433
  Backfilled: 9 (0.02%)
  Missing Mappings: 3
--------------------------------------------------------------------
🔗 Unlinked Sources by Reason (2163)
--------------------------------------------------------------------
-14.57% Object Pool Load (1044)
-1.51% Dispatch Table Call (1082)
-0.20% Unknown (34)
-0.00% Load Static Field Address (3)

It looks like all the trouble came from a single function?

--------------------------------------------------------------------
🔗 2163 Unlinked Leaf Objects caused -16.28%
--------------------------------------------------------------------
-9.63%: null (ff0ff62b), reason: Object Pool Load
-0.73%: null (578765d6), reason: Object Pool Load
...

The single function looks like a late final static initializer:

<details>
<summary>null (ff0ff62b)</summary>
Comparing closest of 40511 matches in the base snapshot.

**hash**
```diff
- 7386077bb6e51c23bbb97c91a7c7f9576fff9e01
+ ff0ff62b097d462c9646260cbfe520c4adbdec8f

instructions

 stp fp, lr, [sp, #-16]!
 mov fp, sp
 sub sp, sp, #0x20
 ldr tmp, [thr, #56]
 cmp sp, tmp
 bls +100 // 0xf2e4 null (62060) + #30
 ldr r0, [thr, #96]
 ldr r0, [r0, #808]
 ldr tmp, [pp, #64]
 nop
 nop
 cmp r0, tmp
 bne +20 // 0xf2b0 null (62060) + #17
- ldr r2, [pp, #30168]
+ ldr r2, [pp, #29904]
 nop
 nop
- bl +14711588 // 0xe16dd0 [Stub] _iso_stub_InitLateFinalStaticFieldStub (14773712) + #0
+ bl +14711180 // 0xe16c38 [Stub] _iso_stub_InitLateFinalStaticFieldStub (14773304) + #0
 ldr tmp, [fp, #32]
 stp tmp, r0, [sp, #16]
 ldr tmp, [fp, #24]
 ldr lr, [fp, #16]
 stp lr, tmp, [sp, #0]
 ldr r4, [pp, #2328]
 nop
 nop
 ldr r2, [r0, #55]
 blr r2
 mov sp, fp
 ldp fp, lr, [sp], #16 !
 ret
- bl +14719872 // 0xe18e64 [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub (14782052) + #0
+ bl +14719464 // 0xe18ccc [Stub] _iso_stub_StackOverflowSharedWithoutFPURegsStub (14781644) + #0
 b -100 // 0xf284 null (62060) + #6
```

eseidel avatar Mar 17 '25 20:03 eseidel

I spent the week working on case one (field table). I'm very close, but not yet ready. I have to work on other things tomorrow morning so I'll probably get back to this tomorrow afternoon, which means we wont ship until early next week at the earliest.

After case one is solved I can reevaluate cases two and three.

eseidel avatar Mar 21 '25 02:03 eseidel

I have a working fix for the field_table issue. However it doesn't work quite well enough to ship yet. Debugging the last 🤞 issue and hopefully shipping tomorrow.

eseidel avatar Mar 24 '25 22:03 eseidel

Last piece for part one (field table) is staged and should go out in a release tomorrow morning: https://github.com/shorebirdtech/shorebird/pull/3008

eseidel avatar Mar 26 '25 03:03 eseidel

The "field table" fix is out. It resolves all known previous problems whereby changes to class-level static variables or file-level variables could cause unexpectedly "low link percentages" (slowness) on iOS. To try the fix, you'll need the latest shorebird with Flutter 3.29.2 or later.

I plan to take a look at the other two cases discussed above shortly.

eseidel avatar Mar 26 '25 18:03 eseidel

I've paused this for the moment. We have one more cause of occasional slowness (dispatch table changes) which we already ship with a partial fix for. Clearly I will eventually need to make a larger fix there, but first I'm going to check again with what our reported rates of hitting this are in the wild.

eseidel avatar Apr 04 '25 17:04 eseidel

Breaking out active work (of a new cause) into https://github.com/shorebirdtech/shorebird/issues/3060.

eseidel avatar Apr 16 '25 22:04 eseidel

I just got a 3rd report of low-link relating to object pool causes. I have an in-progress patch, work is being tracked on #3060.

eseidel avatar Apr 18 '25 21:04 eseidel

As of Shorebird 1.6.35 (Flutter 3.29.3) we now collect better logging in cases where there is low link percentage. If any of you have experienced a slow patch on iOS, I would encourage you to try again with the latest Shorebird and share your patch-debug.zip if you're able. Thank you!

eseidel avatar Apr 23 '25 17:04 eseidel

We also improved our internal stats reporting. 3.29.2 shows the following:

{
  "flutter_revision": "3f9cefb45389b72ff073ddf305fe0939f822143b",
  "avg_link_percentage": "86.883420689811615",
  "p75": "98.322136929328011",
  "p50": "95.825656433950329",
  "p10": "61.074955717803604",
  "p1": "17.850275120587614"
},

P75 means "75% of values in the data are below this, 25% above".

So P50 = 95% is OK, 50% of patches have a good experience, but P10 = 61% means 1 in 10 patches have a bad experience and P1 = 17% means 1 in 100 patches has an abysmal experience on iOS. So we have some work to do, mostly to bring in these outliers.

Once we finish wiring up these stats tomorrow I may start reaching out to customers who have seen a bad experience on ios (based on this data) on the most recent releases and see if I can learn more from them.

eseidel avatar Apr 23 '25 22:04 eseidel

Felix made a fancy dashboard today for this data. It does look like I may have introduced a regression with my most recent changes to the field table sorting, which is probably what we're seeing in https://github.com/shorebirdtech/shorebird/issues/3060 since I don't recall seeing reports like that before recently.

Image

eseidel avatar Apr 25 '25 22:04 eseidel

Regardless, I'm working on a fix for next week which should improve these numbers across the board.

eseidel avatar Apr 25 '25 22:04 eseidel

I have a fix prepared, just working to get it out. Should ship tomorrow.

eseidel avatar May 01 '25 01:05 eseidel

Update:

  • I wrote a tool to find reproductions, so now we have unlimited reproductions of this issue.
  • I've spent a bunch of time looking at failures from the wild (and my tool) and my current guess is inlining is at fault, but still need to learn more.
  • I've otherwise spent my coding time fixing issues in our diagnostic tooling (which has been neglected previously since customer's don't us it).

This is my primary technical focus (when I'm not doing my other non-technical responsibilities). I'll endeavor to post updates more frequently going forward or folks can always get all updates from the company every day on our Discord: https://discord.com/channels/1030243211995791380/1080902010154537010

eseidel avatar May 07 '25 15:05 eseidel

TL;DR we're hard at work on this even though I've not been posting updates. See our #standup channel in https://discord.gg/shorebird for daily updates.

Longer story: We have everyone working on this issue at the moment. Current status is:

  • Bryan is fixing our internal pipeline metrics to make sure we have good data from the field about how often customers are hitting this. Our current understanding is "low link" slowness only affects about 1 in every 10 patches. Still way too high a number, but we're tracking it and should have much better metrics in the next day or so.
  • Felix is working on a specific example we got from a customer where removing a Widget from the flutter hierarchy causes some large sections of the Flutter framework itself to "unlink". I flew out to help him with this last week, and he's close to having something to land, that will hopefully be a large improvement in the wild (both for one specific customer and all customers).
  • I've been working with Felix on his fix as well as improving our debugging tools. We explored a bunch of dead ends to get where we are today. We do now have a tool for reproductions (as in my previous update) which works well. It turns out Inlining is only a small percentage of the failures, as is the TFA (type flow analysis), right now we're looking at object pool sorting improvements.

eseidel avatar Jun 10 '25 16:06 eseidel

@bryanoltman has updated our data pipeline and we finally have new data from the field: Image

Good news is that the best-case continues to improve. Unfortunately the worst-case (which is what this bug is about) is still not where we'd like it to be.

Felix and I continue to make progress on our tooling.

eseidel avatar Jun 10 '25 19:06 eseidel

We're releasing a fix today (requires a new release with 3.32.7 or later) which greatly improves startup time for very large applications. It turns out we had some unnecessary slowness in loading patches for very large applications which we had thought were linking problems but were not. Anyway, those are fixed in 3.32.7 and later and should be live shortly.

eseidel avatar Jul 25 '25 18:07 eseidel