bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Bazel Hangs in the Analysis Phase

Open Louai-Abdelsalam opened this issue 1 year ago • 9 comments

Description of the bug:

Sometimes Bazel just hangs during the Analysis phase (I say the analysis phase because while digging around I noticed bazel starts this function but never completes it). This doesn't happen all the time, but seems to inevitably happen eventually on a non-empty cache. Once it happens, it doesn't go away until I do a bazel clean; until the problem eventually resurfaces again anyway and I have to do another clean.

While bazel clean is in effect a workaround, it's incredibly inconvenient having to rebuild on an empty cache as it takes a long time.

Trying to debug this, I noticed that the heap fills up a lot faster than during a non-problematic build. In fact so much so that garbage collection seems to kick in a lot more frequently that during a non-problematic build, to the point where the analysis profile's bazel memory usage essentially looks like a stretched out zig zag (I'll attach the analysis profile). Though eventually it just plateaus while garbage collection frequency just goes through the roof, in what I can only interpret as an ugly higher-frequency tug-of-war between GC and whatever's flooding the heap. Even the java.log file seems in sync with this as it keeps on logging the following on repeat (ish):

231215 21:54:05.863:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS" ]
231215 21:54:12.183:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 739811504
231215 21:54:43.193:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 13]" ]
231215 21:54:49.720:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 720854168
231215 21:55:27.194:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 14]" ]
231215 21:55:32.777:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 716715296
231215 21:56:07.493:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 11]" ]
231215 21:56:13.471:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 710394040
231215 21:56:56.136:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 14]" ]
231215 21:57:05.708:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 3163893016
231215 21:57:11.948:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=85 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 17]" ]
231215 21:57:22.789:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=99 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 17]" ]
231215 21:57:22.790:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 3905076312
231215 21:57:28.868:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 4000388808
231215 21:57:35.689:I 11 [com.google.devtools.build.lib.skyframe.HighWaterMarkLimiter.handle] Dropping unnecessary temporary state in response to memory pressure. actual=99 threshold=85 [CONTEXT ratelimit_period="10 SECONDS [skipped: 8]" ]
231215 21:57:35.692:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 4020709664
231215 21:57:44.162:I 11 [com.google.devtools.build.lib.metrics.PostGCMemoryUseRecorder.handleNotification] Memory use after full GC: 4050623752

Now it's also worth pointing out that usually I just CTRL+C bazel when it hangs, but one time I left it on for a long time, and eventually it crashed over an OOM error:

18:36:52.106:WT 280 [com.google.devtools.build.lib.concurrent.AbstractQueueVisitor.maybeSaveUnhandledThrowable] Found critical error in queue visitor
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.Arrays.copyOf(Unknown Source)
	at java.base/java.util.ArrayList.grow(Unknown Source)
	at java.base/java.util.ArrayList.grow(Unknown Source)
	at java.base/java.util.ArrayList.add(Unknown Source)
	at java.base/java.util.ArrayList.add(Unknown Source)
	at com.google.devtools.build.lib.skyframe.FileFunction.checkAndNotePathSeenDuringPartialResolution(FileFunction.java:278)
	at com.google.devtools.build.lib.skyframe.FileFunction.resolveFromAncestorsWithParent(FileFunction.java:198)
	at com.google.devtools.build.lib.skyframe.FileFunction.resolveFromAncestors(FileFunction.java:171)
	at com.google.devtools.build.lib.skyframe.FileFunction.compute(FileFunction.java:107)
	at com.google.devtools.build.lib.skyframe.FileFunction.compute(FileFunction.java:52)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:571)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:382)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)

Anyway, so while all these symptoms were relatively easy to pick up on, the cause hasn't been.

I tried looking through a couple jvm thread dumps, but I couldn't really conclude much there. I'll attach one on this issue though in case I missed anything.

I also tried using the latest 6.x version of bazel and tried using java 11, 15, and 17, but none of that made any difference.

I even tried building bazel from source while throwing in a bunch of printouts to kind of track where the issue was, but I got no where that mattered. I did reach some minor conclusions though, and can elaborate on those if needed, but it just felt like I was reverse engineering from the ground up, which in turn felt pretty ineffective.

Anyway, I know that so far I've been alluding to a memory leakage of some kind as the cause, but I'm happy to provide additional information if you guys think it's something else. Just let me know what you need.

Which category does this issue belong to?

Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

There isn't. At least, not any that I've figured out yet. The problem just always seem to resurface eventually on a non-empty cache, until I bazel clean, after which it's gone, only to eventually resurface again. I couldn't figure out a pattern to when it shows up either, as we (me and my coworkers) seem to experience it while doing completely different things, and it even takes different durations of time to pop up for each of us.

Which operating system are you running Bazel on?

Ubuntu

What is the output of bazel info release?

release 6.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

N/A

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

I assume this also depends on using a development version, in which case: N/A

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No. The closest I've found were heap related issues but those were from bazel versions older than 1.0.0. I also found a more recently reported similar issue, but it was in the context of BWOB, but we're not using BWOB (not locally anyway).

Any other information, logs, or outputs that you want to share?

not-working.profile.gz jvm.txt

image

Louai-Abdelsalam avatar Dec 15 '23 23:12 Louai-Abdelsalam

What's your heap size? Have you tried increasing it?

meisterT avatar Dec 20 '23 07:12 meisterT

What's your heap size? Have you tried increasing it?

@meisterT I have tried increasing it yes. I think the most I've tried was 8 gigs? it just ends up filling it all up regardless of whatever I set the heap to. It's not the heap ceiling that's the problem, I don't think. Memory usage is usually somewhere between 4-5 gigs when things are working fine, so the fact that an 8 gig heap filled up when bazel hangs makes me think it's probably not the heap size that's the cause.

Louai-Abdelsalam avatar Dec 22 '23 20:12 Louai-Abdelsalam

Are you able to provide a repro?

tjgq avatar Jan 02 '24 10:01 tjgq

@tjgq I am not, unfortunately. I haven't figured out what's triggering the issue. It's why I provided as much info as I could otherwise.

I did keep the problematic copy of the repo however (i.e I never bazel cleaned it), in case you want me to try something specific to provide more information.

Louai-Abdelsalam avatar Jan 03 '24 01:01 Louai-Abdelsalam

@Louai-Abdelsalam did this build ever succeed (with a previous version of Bazel or a previous version of the repository)?

meisterT avatar Jan 05 '24 10:01 meisterT

@Louai-Abdelsalam did this build ever succeed (with a previous version of Bazel or a previous version of the repository)?

@meisterT Yep. During all instances of me and my coworkers running into this issue, all the targets that were being built had succeeded in the past (on both our current bazel version and on previous ones), and resume succeeding after bazel clean is run. Not sure what you mean by the repository version though? if you simply mean "throughout the history of the repository" then yes too, though the fact that this is implied makes me think this isn't what you mean..

Louai-Abdelsalam avatar Jan 05 '24 12:01 Louai-Abdelsalam

Thanks, one idea after looking at the stack trace again in combination with the ~170s glob in the JSON trace: is it possible that you have a recursive glob somewhere combined with a symlink that points from the source tree to a strange location, such as the output base?

meisterT avatar Jan 09 '24 09:01 meisterT

Thanks, one idea after looking at the stack trace again in combination with the ~170s glob in the JSON trace: is it possible that you have a recursive glob somewhere combined with a symlink that points from the source tree to a strange location, such as the output base?

@meisterT Interesting idea. I'll take a look and get back to you. I'm currently in the middle of a crunch however, so it may be a couple days before I do.

Just to confirm I understand though, you are saying that there might be a recursive glob somewhere that picks up a source symlink file that may in turn be pointing at a strange location. Correct?

Louai-Abdelsalam avatar Jan 09 '24 17:01 Louai-Abdelsalam

Sorry, I missed your reply. Yes, that is exactly what I am saying and have been seeing before with a similar behavior.

meisterT avatar Feb 19 '24 13:02 meisterT