bazel-skylib icon indicating copy to clipboard operation
bazel-skylib copied to clipboard

NullPointerException with bazel@HEAD in CI

Open sgowroji opened this issue 1 year ago • 9 comments

CI: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3681#018de327-837e-4bb4-8986-ec35f627c4d9

Platform: Ubuntu

Logs:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'UnshareableActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, actionIndex=5}' (requested by nodes 'TestCompletionKey{configuredTargetKey=ConfiguredTargetKey{label=//tests:subpackages_all_w_exclude, config=BuildConfigurationKey[86b97794d575f038822a36ec1392fdddcdcaebfcd711a3ac8088056fc5bb38f5]}, topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, exclusiveTesting=false}')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:551)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
	at com.google.devtools.build.lib.vfs.Path.getRelative(Path.java:121)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandboxInternal(SandboxStash.java:99)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandbox(SandboxStash.java:73)
	at com.google.devtools.build.lib.sandbox.SymlinkedSandboxedSpawn.filterInputsAndDirsToCreate(SymlinkedSandboxedSpawn.java:78)
	at com.google.devtools.build.lib.sandbox.AbstractContainerizingSandboxedSpawn.createFileSystem(AbstractContainerizingSandboxedSpawn.java:135)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:146)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:113)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:456)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:160)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.runTestAttempt(StandaloneTestStrategy.java:652)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy.beginTestAttempt(StandaloneTestStrategy.java:315)
	at com.google.devtools.build.lib.exec.StandaloneTestStrategy$StandaloneTestRunnerSpawn.execute(StandaloneTestStrategy.java:577)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.executeAllAttempts(TestRunnerAction.java:1162)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:974)
	at com.google.devtools.build.lib.analysis.test.TestRunnerAction.execute(TestRunnerAction.java:951)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1140)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1053)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:166)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:95)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:547)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:903)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:349)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:191)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:462)
	... 7 more

Steps:

git clone https://github.com/bazelbuild/bazel-skylib
git reset e60cf0001fdc1500cf5ba2a6bf4ca0b46a1f725f  --hard
export USE_BAZEL_VERSION=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139
bazel test  //... @bazel_skylib_gazelle_plugin//... 

CC Greenteam @meteorcloudy

sgowroji avatar Feb 26 '24 05:02 sgowroji

Looks like this was caused by https://cs.opensource.google/bazel/bazel/+/e9022f6731b4f62d3a08bdc4eacce70ad28e3c78.

Cc @oquenchil

fmeum avatar Feb 26 '24 07:02 fmeum

Taking a look.

oquenchil avatar Feb 26 '24 08:02 oquenchil

I'm a bit puzzled about why this happens. I can't reproduce locally with 100000 runs per test, I don't see a race condition and even if the variables that I take from the environment TEST_WORKSPACE and TEST_SRCDIR were null, the string placed inside the map wouldn't be null.

The only thing I can think of right now is adding more logging to detect if we put/get null from the concurrent hash map, then log all the environment variables in the environment and all the entries in the map.

We'd check that in and if we see the exception again use the new info to find out what's going on.

@fmeum wdyt?

oquenchil avatar Feb 26 '24 10:02 oquenchil

The renameTo call isn't guaranteed to be atomic in all filesystem setups, so couldn't (at least in theory) two test actions both enter the isTestAction branch in takeStashedSandboxInternal for the same stashed sandbox? That would explain the issue.

But yes, more logging would certainly help.

fmeum avatar Feb 26 '24 12:02 fmeum

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

oquenchil avatar Feb 26 '24 13:02 oquenchil

I thought it was always atomic within the same filesystem. In which filesystem types isn't it atomic?

That does seem to be true for the UnixFileSystem, but the JavaIoFileSystem may not be as File#renameTo isn't guaranteed to be atomic according to its javadocs and is used here: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/vfs/JavaIoFileSystem.java;l=318;drc=9d34f8ab0f1ffb18900feaeb23cb16c93f4e0139

It shouldn't be used on Ubuntu unless there is some other bug, so you are right that this isn't likely to be the cause here.

fmeum avatar Feb 26 '24 13:02 fmeum

I think that there is a race, I'm just not sure whether it can arise in practice: If spawn #1 blocks on the ConcurrentHashMap#put call in https://cs.opensource.google/bazel/bazel/+/f05c9d0b8d32d29847d5b16af1e5f8c20d11f66d:src/main/java/com/google/devtools/build/lib/sandbox/SandboxStash.java;l=160, then it will have already moved its sandbox dir into the stash. Spawn #2 could discover and move this directory before the put returns and would then run into an NPE when querying the map.

fmeum avatar Feb 26 '24 13:02 fmeum

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

oquenchil avatar Feb 26 '24 20:02 oquenchil

Ah yes, but would we have seen this error? From the error I'd say there was an entry in the map but it was actually null somehow.

Map#get also returns null if the key isn't in the map. I think that's more likely than the value actually being null.

Apart from logging the environment variables and the contents of the map, can you think of anything else that would be useful to log?

The result of the readdir of the stashes directory would also be interesting.

fmeum avatar Feb 26 '24 20:02 fmeum

Wonderful, thanks!

oquenchil avatar Feb 27 '24 11:02 oquenchil

Downstream CI of bazel-skylib is Green now https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3689#018df801-66e8-42eb-8fcf-b047fad7a8cd. We can close this issue.

sgowroji avatar Mar 01 '24 05:03 sgowroji