bazel-remote icon indicating copy to clipboard operation
bazel-remote copied to clipboard

Build fails on Bazel 7.0 when remote_download_toplevel flag is enabled

Open sanju-naik opened this issue 1 year ago • 8 comments

After upgrading to Bazel 7.0.0 and enabling remote_download_toplevel flag we are noticing our builds are failing intermittently while downloading cached artifacts from remote Cache.

2 errors we get are:

Exec failed due to IOException: Connection reset
Exec failed due to IOException: null

There are no other details in the log. Other things we noticed are :

  • This happens when artifacts are 100% cached i.e download everything from Cache.
  • Also noticed when the job fails, the module it shows as downloading at the end of the logs is always same, not sure if it has anything to do with that Module?

sanju-naik avatar Jan 31 '24 15:01 sanju-naik

Are there any relevant errors or warnings in the bazel-remote log when this occurs?

mostynb avatar Feb 01 '24 21:02 mostynb

Today when one of our jobs failed, I got this error log in the job. Does this help in any way to debug this issue?

---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Failed to read @-argument 'bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params' from file '/private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params'.
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:315)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.createWorkRequest(WorkerSpawnRunner.java:246)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:416)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:206)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:261)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: /private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params (No such file or directory)
	at java.base/java.io.FileInputStream.open0(Native Method)
	at java.base/java.io.FileInputStream.open(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at com.google.devtools.build.lib.unix.UnixFileSystem.createFileInputStream(UnixFileSystem.java:497)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.createMaybeProfiledInputStream(AbstractFileSystem.java:90)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:59)
	at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:765)
	at com.google.devtools.build.lib.vfs.FileSystemUtils$1.openStream(FileSystemUtils.java:354)
	at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474)
	at com.google.common.io.CharSource.openBufferedStream(CharSource.java:126)
	at com.google.common.io.CharSource.readLines(CharSource.java:336)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.readLines(FileSystemUtils.java:834)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:310)
	... 23 more
---8<---8<--- End of exception details ---8<---8<---

sanju-naik avatar Feb 07 '24 06:02 sanju-naik

I don't know bazel internals, but this stack trace looks like this is failing when trying to execute the action on the client side. Have you tried reporting this error to the bazel project?

mostynb avatar Feb 08 '24 19:02 mostynb

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

mostynb avatar Feb 08 '24 19:02 mostynb

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

sanju-naik avatar Feb 09 '24 11:02 sanju-naik

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

sanju-naik avatar Feb 09 '24 11:02 sanju-naik

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

I think it depends a bit on the logging options that you are using. If you have timestamps enabled you can jump to a time just before the error and scan from there. Alternatively if you have access logs enabled you might be able to search for a blob or ActionResult hash from the error (if you have something like that in the bazel logs). Or maybe you could just grep the bazel-remote logs for "error" or "warning" (ignoring case) and see if there's anything interesting.

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

The releases page has a high-level changelog: https://github.com/buchgr/bazel-remote/releases - but I don't think there are any changes specifically related to bazel 7.

mostynb avatar Feb 11 '24 12:02 mostynb

Currently we have many bazel 7.0.0 remote_download_toplevel builds each day using a bazel-remote cache without problem. IOException: Connection reset would suggest the connection was dropped. Do you use HTTP(S) or GRPC(S) for the cache url in bazel? Is there a proxy between your bazel clients and the bazel-remote server (even on the same machine)?

liam-baker-sm avatar Mar 19 '24 02:03 liam-baker-sm