bazel java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1

Description of the bug:

Upgrading from bazel 7.0.2 to 7.1.1 resulted in consistent OOM exception being thrown during bazel query

04:45:00 [Bazel] Loading: 15 packages loaded
04:45:02 [Bazel] Loading: 268 packages loaded
04:45:02 [Bazel]     currently loading: bzl ... (1719 packages)
04:45:04 [Bazel] Loading: 1990 packages loaded
04:45:04 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (12 packages)
04:45:05 [Bazel] Loading: 2067 packages loaded
04:45:05 [Bazel]     currently loading: @@bazel_tools//tools/jdk ... (2 packages)
04:45:06 [Bazel] Loading: 2078 packages loaded
04:45:06 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:07 [Bazel] Loading: 2079 packages loaded
04:45:07 [Bazel]     currently loading: @@remote_java_tools//java_tools/zlib
04:45:08 [Bazel] Loading: 2171 packages loaded
04:45:09 [Bazel] Loading: 2364 packages loaded
04:45:10 [Bazel] Loading: 2543 packages loaded
04:45:11 [Bazel] Loading: 2650 packages loaded
04:45:13 [Bazel] Loading: 2889 packages loaded
04:45:14 [Bazel] Loading: 2889 packages loaded
04:45:15 [Bazel] Loading: 2931 packages loaded
04:45:16 [Bazel] Loading: 3036 packages loaded
04:45:17 [Bazel] Loading: 3457 packages loaded
04:45:18 [Bazel] Loading: 3862 packages loaded
04:45:19 [Bazel] Loading: 4655 packages loaded
04:45:20 [Bazel] Loading: 5464 packages loaded
04:45:21 [Bazel] Loading: 6297 packages loaded
04:45:22 [Bazel] Loading: 7193 packages loaded
04:45:23 [Bazel] Loading: 7996 packages loaded
04:45:23 [Bazel] FATAL: bazel ran out of memory and crashed. Printing stack trace:
04:45:23 [Bazel] java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.base/java.lang.Thread.start0(Native Method)
	at java.base/java.lang.Thread.start(Unknown Source)
	at java.base/java.lang.System$2.start(Unknown Source)
	at java.base/jdk.internal.vm.SharedThreadContainer.start(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.createWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.tryCompensate(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.compensatedBlock(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue$TransferStack.transfer(Unknown Source)
	at java.base/java.util.concurrent.SynchronousQueue.take(Unknown Source)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:170)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:418)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:205)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

linux

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Suggestion from bazel public slack to disable experimental_worker_for_repo_fetching with --experimental_worker_for_repo_fetching=off resolved issues

Any other information, logs, or outputs that you want to share?

No response

Mar 25 '24 21:03 ivan-golub

@ivan-golub How many CPUs do you have on your machine? And how many repos are there that could potentially be fetched in parallel?

My suspicion: Loom doesn't use non-blocking file I/O yet and instead creates additional native threads when a virtual thread is blocked on file operations. If too many repos are blocked on them in parallel, this could run into the same thread limits as with a native thread pool. Hopefully we don't reach the OS limit and just need to tweak some native memory settings.

Mar 25 '24 21:03 fmeum

Thanks for the report. Some questions:

Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)
Does the OOM happen with any query? Or what are the queries that cause this?
Do you have a sense of how many external repos you have defined? (you could run something like bazel query //external:all-targets | wc -l)

Mar 25 '24 21:03 Wyverald

How many CPUs do you have on your machine

32 cores

How many repos are there that could potentially be fetched in parallel? bazel query //external:all-targets | wc -l

7131

Do you have Bzlmod enabled? (check your .bazelrc file for --noenable_bzlmod)

Bzlmod disabled

Does the OOM happen with any query?

Definitely happens with wildcard query //... we use for diff aware bulds

Mar 26 '24 01:03 ivan-golub

@bazel-io fork 7.2.0

Mar 26 '24 16:03 meteorcloudy

We (Figma) experienced this too.

Bzlmod is not enabled (--noenable_bzlmod)
OOM happens for us on any bazel action that involves fetching a significant number of repositories (mostly bazel build rather than bazel query)
bazel query //external:all-targets | wc -l: 20720

Mar 26 '24 17:03 jfirebaugh

@ivan-golub @jfirebaugh Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

Mar 26 '24 18:03 fmeum

Could you share at least a rough breakdown of which rulesets/repo rules contribute to this number of external repos?

its an android repo in our case, so androidx, android_tools, dagger, kotlin, jdk, robelectric, some jetbrains libs, internal protobuf repos and 1st/3rd party libs

Mar 26 '24 19:03 ivan-golub

Is this still on track for 7.2? We're aiming to create the first RC on 5/13.

Apr 29 '24 15:04 keertk

This one is hard to pin down. We'd still like to fix if we can get a hold of it, but it's possible that the fix will only be in time for a 7.2.1. Marked #21815 as a soft blocker.

May 08 '24 21:05 Wyverald

I've also been experiencing this for a while now. In my case, building a Python zip after changing more than a couple requirement versions in requirements.in and running a bazel build would spawn many, many python.pip_install.tools.wheel_installer.wheel_installer processes, eventually using 100% of the machine memory and CPU.

Analyzing: target <redacted>_bin (125 packages loaded, 3295 targets configured)
[1 / 1] checking cached actions
    Fetching repository @@rules_python~~pip~pip_312_starlette; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_requests; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_tenacity; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_pydantic_yaml; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_nest_asyncio; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_playwright_stealth; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_typer; starting 22s
    Fetching repository @@rules_python~~pip~pip_312_uuid_utils; starting 22s ... (30 fetches)

Server terminated abruptly (error code: 14, error message: 'Connection reset by peer', log file: '/root/.cache/bazel/_bazel_root/b8ccf54d4a62f705275a4051f267d262/server/jvm.out')

I tried many different flags like --jobs=1 --local_resources=cpu=1.0 --local_resources=memory=256, but those have no effect on the repository-fetching code.

I was again trying to solve this issue today, and saw the experimental_worker_for_repo_fetching in the Bazel changelogs, and found the issue mentioning that it started defaulting to auto recently: https://github.com/bazelbuild/bazel/pull/21082

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

I hope other folks running into bazel build OOMs due to --experimental_worker_for_repo_fetching=auto also find this issue.

Jun 03 '24 15:06 mpereira

We also experience this issue after updating bazel from 6.4.0 to 7.1.2. We use rules_nixpkgs for external repositories. When I build a target with a lot of external dependencies, I see, that bazel generates too many nix build processes (over 300). And our dev container crashes with OOM. The problem is that now the number of processes is limited by host resources (there are 256 cores on the host), but we used --loading_phase_threads to limit number of generated processes to account for the amount of resources that were allocated for a particular dev container. Now it seems, that the value of --loading_phase_threads is ignored for some reason.

I can confirm, that I can't reproduce the issue with bazel 7.0.2.

Is there an estimate for when it will be resolved?

UPD:

I tried running a bazel build with --experimental_worker_for_repo_fetching=off and that immediately fixed the issue! Memory consumption during repository fetching was minimal.

This fixed issue for me as well. Thanks @mpereira ! Now the maximum number of processes is 2x(value of loading_phase_threads).

Jun 04 '24 12:06 GorshkovNikita

I tried to reproduce this for a while, both on synthetic and real-world projects, but I haven't observed any meaningful difference between 7.0.2 and 7.1.2. --loading_phase_threads is being honored in my experiments and I can have > 200,000 repos in a dependency chain on my laptop.

The flip of --experimental_worker_for_repo_fetching in 7.1.2 may result in more repo rules doing actual work in parallel than before simply because less time is spent in restarts, but the number of concurrent repository_ctx.execute calls should still be limited by --loading_phase_threads, even with Skyframe enabled.

If you reported an issue in this thread, could you try running with 7.2.0rc2 and an explicit --loading_phase_threads value and then share a Starlark profile (can be emitted into the workspace directory with --profile)? A standalone reproducer would be ideal, but the profile would already be very helpful.

Jun 05 '24 14:06 fmeum

I found a clean reproducer on Slack (thanks @hugocarr): https://github.com/hugocarr/cloud_repro/tree/hugo/requirements_oom

I get an OOM with --loading_phase_threads=4 that I don't get with --loading_phase_threads=4 --experimental_worker_for_repo_fetching=off.

The profiles show that with off, there are never more than 4 concurrent executes, with auto, there are hundreds. profile.fails.json profile.works.json

@meteorcloudy @Wyverald

Jun 07 '24 08:06 fmeum

Good news: I tested the repro with 7.2.0rc3 and it seems that the issue is fixed there. I don't know why though. profile.rc3.json

Jun 07 '24 10:06 fmeum

Maybe somehow fixed by https://github.com/bazelbuild/bazel/pull/22573?

@ivan-golub Can you please also verify this issue no longer exists with 7.2.0rc3?

Jun 07 '24 14:06 meteorcloudy

Just to reiterate: Chatting with @fmeum in Bazel slack about an OOM we were experiencing during the fetch stage when we run bazel build @pypi//... for a project with many 3rd party Python dependencies.

It appears that upgrading from 7.1.2 to 7.2.0rc3 solves this issue for us. Memory stays stable. Not sure if it's directly causal but wanted to add a 👍 for #thissolvedmyproblem

Jun 07 '24 16:06 hugocarr

This is fixed in 7.2.0 for us too.

Jun 10 '24 17:06 jfirebaugh

Thanks for the reports! I'll go ahead and close this for now. If new reports surface, we can revisit.

Jun 10 '24 18:06 Wyverald

bazel bazel copied to clipboard

java.lang.OutOfMemoryError when upgrading from Bazel 7.0.2 to 7.1.1

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

bazel
bazel copied to clipboard

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?