RangeError: Maximum callstack size exceeded with many modules in single bundle
Bug Description
We are unable to open our full app in development, which runs a single bundle with over 40,000 modules. In production we are using Re.Pack which enables code-splitting and does not run into this problem.
In this PR the default JS register count was doubled - https://github.com/facebook/hermes/pull/923. Sadly we are hitting the same error again.
Note that the error is not reproducible using JSC in our reproduction repo (see below).
- [x] I have run
gradle cleanand confirmed this bug does not occur with JSC - [x] The issue is reproducible with the latest version of React Native.
Environment
Hermes git revision (if applicable): unknown, tied to RN versions below React Native version: 0.72.12 (our app), 0.74.2 (reproduction app) OS: OSX Platform (most likely one of arm64-v8a, armeabi-v7a, x86, x86_64): unknown, this occurs in the iOS simulator and android emulator
Steps To Reproduce
I've been able to mostly* reproduce it in this repo - https://github.com/mikeduminy/rn-reproducer.
*The error we're experiencing in our app is"RangeError: Maximum callstack size exceeded - no stack"whereas the error in this reproduction repo is simply"RangeError: Maximum callstack size exceeded"so I'm not 100% certain it has the same cause.
Steps to reproduce:
- Clone above repo and install dependencies
- Generate 50,000 modules to simulate a large app (note: no lazy imports)
- Run the app on iOS simulator and Android emulator
- Observe the red screen (note, you may need to close and re-open the app to see this)
The Expected Behavior
A unhandled JS Exception should be thrown but it would not be a call stack size exceeded error.
In order to test if the callstack resulted from executing the entrypoint module of the bundle (the last few lines of the bundle) I added a forced error just before that.
serializer: {
getRunModuleStatement: moduleId =>
// If we see this error, it means the module was fully parsed
// If we see another error, it means the problem occurred during the bundle parsing
`throw new Error("before module execution: ${JSON.stringify(
moduleId,
)}"); __r(${JSON.stringify(moduleId)});`,
},
If you see this error then hermes has successfully executed the bundle. If you see the max call stack error then this error was never even reached and that is the problem we are facing in our app - the bundle is not fully executed / loaded.
Hey @mikeduminy, could you share more about why you suspect a bug in Hermes? It is likely that loading the bundle just requires a lot of deep recursion, which exhausts Hermes' register stack. Release mode builds also enable optimisations, which means that you may not observe the bug in release mode because individual stack frames are smaller.
Some other things you could try:
- Increasing the register stack size further by configuring it in the Hermes
RuntimeConfig. - Creating and running a bundle locally with the
hermesCLI tool, which should give you better visibility into what is happening.
@neildhar I thought about recursion being the problem which is why I investigated the resulting bundle and found that metro is basically just declaring a bundle of modules (not executing them), then executing the entrypoint. For example:
(function(){ /* load metro runtime */ })()
(function(){ /* load react-native stuff */ })()
__d(function() { /* module code */ }, 0, [], "./index.js")
__d(function() { /* module code */ }, 1, [0], "./anotherModule.js")
/* ... more declaration calls */
throw new Error("before module execution: 0")
__r(0) // begin evaluating the modules (entrypoint)
The
__dfunction is fairly simple, just adding the unevaluated modules into a map. The code has not changed in over a year.
Since the execution of the bundle fails before the entrypoint is run it leads me to believe that the number of available registers is the problem - i.e. not a recursion problem, also the stack trace is empty. Since it works on JSC I would advocate for an increase in default allocated registers. We cannot easily increase this number as consumers of React Native.
In our app we can actually fix this problem by randomly commenting out imports until it works. It doesn't seem to matter what we comment out, just that the resulting modules falls below some unknown count.
Interesting, if there isn't significant recursion, that would suggest that the global function itself has such a large stack frame that it exhausts the stack. Have you been able to construct such a bundle that overflows in the Hermes CLI? That will make it much easier for us to investigate what might be going on.
That's interesting, maybe modules[moduleId] = mod; is overflowing the number of properties on a single object 🤔
I've been able to replicate the overflow in my repro app so using that to generate a bundle that fails is totally possible. Let me know if you want me to make it super easy to generate the bundle and I'll prep a script.
The other thing to note is the loaded bundle is not run through the hermes transform as it is failing in dev mode. I have not tested it in production but can do so and see if the problem persists.
Can you help me with the correct commands to run to execute the bundle in the hermes CLI?
We have instructions for building and running Hermes here.
Alternatively, you could try running against a precompiled copy of Hermes downloaded from the releases page (although they haven't been updated for some time).
Make sure that you run with -O0 (to disable optimisations) and potentially specify -max-num-registers to ensure that you're getting a representative run.
@mikeduminy it might also be very useful if you could upload somewhere the final JS bundle of the repro app (before it is compiled by Hermes). Then we can compile it and examine the generated bytecode.
(As a rule, it is very difficult for us to reproduce React Native builds, but having the final JS bundle goes a long way)
I have added both working and broken bundles for Android and iOS here. The difference between the working and broken bundles is a single additional module.
I also conducted some analysis of each bundle only to discover that the breaking point for both platforms is when the number of modules exceeds 50357.
- Android
- ✅ working bundle | (meta data) - 50357 modules
- ❌ broken bundle | (meta data) - 50358 modules
- iOS
- ✅ working bundle | (meta data) - 50357 modules
- ❌ broken bundle | (meta data) - 50358 modules
We have instructions for building and running Hermes here.
Alternatively, you could try running against a precompiled copy of Hermes downloaded from the releases page (although they haven't been updated for some time).
Make sure that you run with
-O0(to disable optimisations) and potentially specify-max-num-registersto ensure that you're getting a representative run.
Assuming you don't need me to do this anymore? Please let me know :)
@mikeduminy Not yet at least, hopefully the bytecode output of the bundle is enough to pinpoint what is going on.
I found the problem, the size of the global function is causing the register allocator to hit a memory limit and fall back to fast but very inefficient allocation. As a result the global function ends up singlehandedly overflowing the stack. We'll work on a fix.
Hey @neildhar, awesome news that you found the problem!
I've explored the world of register allocators a little bit which has been interesting. It looks like Hermes is assuming that the "fast pass" allocator will be able to handle anything under the fass pass threshold and anything above a memory limit
https://github.com/facebook/hermes/blob/cf111e67f958e5734fc2546e428e9c6238b73364/lib/BCGen/RegAlloc.cpp#L611-L612
The first condition looks sound and I assume that our bundle is above the threshold, but what is the purpose of the second condition? It seems like that might be what we are hitting in this case.
Since this particular case is only happening when evaluating JS (not hermes bytecode) and only happens during development, startup speed does not matter as much to us. However, if the goal is for Hermes to support loading JS bundles in production (like using CodePush without Hermes) then I can see it being difficult to find a balance.
The idea is that the threshold is configurable, so it can be adjusted to different levels in different cases. Unfortunately, it looks like we never actually did the adjustment.
what is the purpose of the second condition? It seems like that might be what we are hitting in this case.
@mikeduminy Yep, exactly, your bundle is hitting this case. In dev-mode, the compiler is usually running on a phone, where memory is relatively limited. The full register allocator can be quite memory intensive (more than linear in the size of the function), so we have this limit to avoid OOMing when we allocate (we have seen OOMs because of this in practice). The current limit is fairly conservative though, and as @tmikov said, it should be configurable depending on the use case.
That said, we have a fix in 6b69a06fc4a498ac467b590985515100c63313c3 that improves the code quality of the fast pass significantly. In the case of your bundle, it reduces the register usage for the global function from 402916 to 39. That fix has currently landed in the static_h branch, and we'll port it to main shortly.
@neildhar fantastic news! Once that lands in main can we get it backported to RN 72 or 73?
It has already landed in main in 20b13b96f49c6819f6090f17a5ec29d3faec45c6 and you can request a pick in the RN release discussions. The changes are relatively isolated so they should be safe to pick.
It is a bug fix, so RN should be willing to pick it.
Hey @neildhar @tmikov how far behind can we pick this? @mikeduminy is requesting a pick down to 0.72 which is unlikely to happen as it's end-of-life.
However the pick for 0.73 would mean applying those changes on top of the version of Hermes we picked in September 2023. I also attempted to pick https://github.com/facebook/hermes/commit/20b13b96f49c6819f6090f17a5ec29d3faec45c6 on top of
rn/0.73-stable and it merge conflicts on:
Unmerged paths:
(use "git add/rm <file>..." as appropriate to mark resolution)
deleted by us: test/Parser/flow/declare-module-import-allowed.js
Just confirming that those changes are safe and how should we resolve this conflict
@cortinico interesting. It appears that test/Parser/flow/declare-module-import-allowed.js was incorrectly included in the diff. @neildhar any idea why?
Seems like ignoring changes to that file should work?
The only change to that file is this added line: // Auto-generated content below. Please do not modify manually.
https://github.com/facebook/hermes/commit/20b13b96f49c6819f6090f17a5ec29d3faec45c6#diff-97c6e0145c7c283475c236bc7a3f9987026beff83bed44f8b4be9e5a7cf6c6b6R14
@tmikov it looks like that file was initially added with FileCheckOrRegen, but this is the first time we actually ran update-lit to regenerate tests, so it ended up getting swept up in this diff.