bazel
bazel copied to clipboard
java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1
Description of the bug:
Upgrading from bazel 7.0.2 to 7.1.1 resulted in consistent OOM exception being thrown during bazel query
04:45:00 [Bazel] Loading: 15 packages loaded
04:45:02 [Bazel] Loading: 268 packages loaded
04:45:02 [Bazel] currently loading: bzl ... (1719 packages)
04:45:04 [Bazel] Loading: 1990 packages loaded
04:45:04 [Bazel] currently loading: @@bazel_tools//tools/jdk ... (12 packages)
04:45:05 [Bazel] Loading: 2067 packages loaded
04:45:05 [Bazel] currently loading: @@bazel_tools//tools/jdk ... (2 packages)
04:45:06 [Bazel] Loading: 2078 packages loaded
04:45:06 [Bazel] currently loading: @@remote_java_tools//java_tools/zlib
04:45:07 [Bazel] Loading: 2079 packages loaded
04:45:07 [Bazel] currently loading: @@remote_java_tools//java_tools/zlib
04:45:08 [Bazel] Loading: 2171 packages loaded
04:45:09 [Bazel] Loading: 2364 packages loaded
04:45:10 [Bazel] Loading: 2543 packages loaded
04:45:11 [Bazel] Loading: 2650 packages loaded
04:45:13 [Bazel] Loading: 2889 packages loaded
04:45:14 [Bazel] Loading: 2889 packages loaded
04:45:15 [Bazel] Loading: 2931 packages loaded
04:45:16 [Bazel] Loading: 3036 packages loaded
04:45:17 [Bazel] Loading: 3457 packages loaded
04:45:18 [Bazel] Loading: 3862 packages loaded
04:45:19 [Bazel] Loading: 4655 packages loaded
04:45:20 [Bazel] Loading: 5464 packages loaded
04:45:21 [Bazel] Loading: 6297 packages loaded
04:45:22 [Bazel] Loading: 7193 packages loaded
04:45:23 [Bazel] Loading: 7996 packages loaded
04:45:23 [Bazel] FATAL: bazel ran out of memory and crashed. Printing stack trace:
04:45:23 [Bazel] java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Unknown Source)
at java.base/java.lang.System$2.start(Unknown Source)
at java.base/jdk.internal.vm.SharedThreadContainer.start(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.createWorker(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.tryCompensate(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.compensatedBlock(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source)
at java.base/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
at java.base/java.util.concurrent.SynchronousQueue.take(Unknown Source)
at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:170)
at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:418)
at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:205)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(Unknown Source)
at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
linux
What is the output of bazel info release?
No response
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
Suggestion from bazel public slack to disable experimental_worker_for_repo_fetching with --experimental_worker_for_repo_fetching=off resolved issues
Any other information, logs, or outputs that you want to share?
No response
@ivan-golub How many CPUs do you have on your machine? And how many repos are there that could potentially be fetched in parallel?
My suspicion: Loom doesn't use non-blocking file I/O yet and instead creates additional native threads when a virtual thread is blocked on file operations. If too many repos are blocked on them in parallel, this could run into the same thread limits as with a native thread pool. Hopefully we don't reach the OS limit and just need to tweak some native memory settings.
Thanks for the report. Some questions:
- Do you have Bzlmod enabled? (check your .bazelrc file for
--noenable_bzlmod) - Does the OOM happen with any query? Or what are the queries that cause this?
- Do you have a sense of how many external repos you have defined? (you could run something like
bazel query //external:all-targets | wc -l)
How many CPUs do you have on your machine
32 cores
How many repos are there that could potentially be fetched in parallel?
bazel query //external:all-targets | wc -l
7131
Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)
Bzlmod disabled
Does the OOM happen with any query?
Definitely happens with wildcard query //... we use for diff aware bulds
@bazel-io fork 7.2.0
We (Figma) experienced this too.
- Bzlmod is not enabled (
--noenable_bzlmod) - OOM happens for us on any bazel action that involves fetching a significant number of repositories (mostly
bazel buildrather thanbazel query) bazel query //external:all-targets | wc -l: 20720
@ivan-golub @jfirebaugh Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?
Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?
its an android repo in our case, so androidx, android_tools, dagger, kotlin, jdk, robelectric, some jetbrains libs, internal protobuf repos and 1st/3rd party libs
Is this still on track for 7.2? We're aiming to create the first RC on 5/13.
This one is hard to pin down. We'd still like to fix if we can get a hold of it, but it's possible that the fix will only be in time for a 7.2.1. Marked #21815 as a soft blocker.
I've also been experiencing this for a while now. In my case, building a Python zip after changing more than a couple requirement versions in requirements.in and running a bazel build would spawn many, many python.pip_install.tools.wheel_installer.wheel_installer processes, eventually using 100% of the machine memory and CPU.
Analyzing: target <redacted>_bin (125 packages loaded, 3295 targets configured)
[1 / 1] checking cached actions
Fetching repository @@rules_python~~pip~pip_312_starlette; starting 22s
Fetching repository @@rules_python~~pip~pip_312_requests; starting 22s
Fetching repository @@rules_python~~pip~pip_312_tenacity; starting 22s
Fetching repository @@rules_python~~pip~pip_312_pydantic_yaml; starting 22s
Fetching repository @@rules_python~~pip~pip_312_nest_asyncio; starting 22s
Fetching repository @@rules_python~~pip~pip_312_playwright_stealth; starting 22s
Fetching repository @@rules_python~~pip~pip_312_typer; starting 22s
Fetching repository @@rules_python~~pip~pip_312_uuid_utils; starting 22s ... (30 fetches)
Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/root/.cache/bazel/_bazel_root/b8ccf54d4a62f705275a4051f267d262/server/jvm.out')
I tried many different flags like --jobs=1 --local_resources=cpu=1.0 --local_resources=memory=256, but those have no effect on the repository-fetching code.
I was again trying to solve this issue today, and saw the experimental_worker_for_repo_fetching in the Bazel changelogs, and found the issue mentioning that it started defaulting to auto recently: https://github.com/bazelbuild/bazel/pull/21082
I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.
I hope other folks running into bazel build OOMs due to --experimental_worker_for_repo_fetching=auto also find this issue.
We also experience this issue after updating bazel from 6.4.0 to 7.1.2. We use rules_nixpkgs for external repositories. When I build a target with a lot of external dependencies, I see, that bazel generates too many nix build processes (over 300). And our dev container crashes with OOM. The problem is that now the number of processes is limited by host resources (there are 256 cores on the host), but we used --loading_phase_threads to limit number of generated processes to account for the amount of resources that were allocated for a particular dev container. Now it seems, that the value of --loading_phase_threads is ignored for some reason.
I can confirm, that I can't reproduce the issue with bazel 7.0.2.
Is there an estimate for when it will be resolved?
UPD:
I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.
This fixed issue for me as well. Thanks @mpereira ! Now the maximum number of processes is 2x(value of loading_phase_threads).
I tried to reproduce this for a while, both on synthetic and real-world projects, but I haven't observed any meaningful difference between 7.0.2 and 7.1.2. --loading_phase_threads is being honored in my experiments and I can have > 200,000 repos in a dependency chain on my laptop.
The flip of --experimental_worker_for_repo_fetching in 7.1.2 may result in more repo rules doing actual work in parallel than before simply because less time is spent in restarts, but the number of concurrent repository_ctx.execute calls should still be limited by --loading_phase_threads, even with Skyframe enabled.
If you reported an issue in this thread, could you try running with 7.2.0rc2 and an explicit --loading_phase_threads value and then share a Starlark profile (can be emitted into the workspace directory with --profile)? A standalone reproducer would be ideal, but the profile would already be very helpful.
I found a clean reproducer on Slack (thanks @hugocarr): https://github.com/hugocarr/cloud_repro/tree/hugo/requirements_oom
I get an OOM with --loading_phase_threads=4 that I don't get with --loading_phase_threads=4 --experimental_worker_for_repo_fetching=off.
The profiles show that with off, there are never more than 4 concurrent executes, with auto, there are hundreds.
profile.fails.json
profile.works.json
@meteorcloudy @Wyverald
Good news: I tested the repro with 7.2.0rc3 and it seems that the issue is fixed there. I don't know why though. profile.rc3.json
Maybe somehow fixed by https://github.com/bazelbuild/bazel/pull/22573?
@ivan-golub Can you please also verify this issue no longer exists with 7.2.0rc3?
Just to reiterate: Chatting with @fmeum in Bazel slack about an OOM we were experiencing during the fetch stage when we run bazel build @pypi//... for a project with many 3rd party Python dependencies.
It appears that upgrading from 7.1.2 to 7.2.0rc3 solves this issue for us. Memory stays stable. Not sure if it's directly causal but wanted to add a 👍 for #thissolvedmyproblem
This is fixed in 7.2.0 for us too.
Thanks for the reports! I'll go ahead and close this for now. If new reports surface, we can revisit.