bazel-buildfarm Question: How can I setup the remote executor timeout?

Hi guys,

This project is really good. I really love it. However, I do have a question. I started one scheduler and one worker remotely. Normally, everything works fine, I can do bazel build/test remotely. But, if the worker is down, then it seems like the scheduler will be in an infinite loop where it throws the "no available workers" exception. Bazel client gets stuck too. So my question is, in this situation, how Bazel client can stop the remote execution and return a result? I tried --test_timeout but it did not work.

Thanks

Sep 16 '22 22:09 WSUFan

We're glad you like it.

Regarding "no available workers" - this is a response that gets sent as a part of a request to upload cas content to a shard cluster. Our choice to answer like this is based on a desire to avoid making a client wait forever for a situation to resolve itself that will likely require manual intervention.

We can only indirectly affect what the client will do in this circumstance, and bazel has chosen to be in a loop for these kinds of responses. There are the following options to bazel which may still effect the 'stop and return result' scenario you're interested in:

  --[no]remote_local_fallback (a boolean; default: "false")
    Whether to fall back to standalone local execution strategy if remote 
    execution fails.
  --remote_local_fallback_strategy (a string; default: "local")
    No-op, deprecated. See https://github.com/bazelbuild/bazel/issues/7480 for 
    details.

Based on the copy in the second one, I'm not sure how effective it will be.

These edge cases are very much in the bazel client, and not buildfarm's domain to resolve - clients should imho always be written to avoid getting stuck, handle failure, and expose flaws in the remote system, rather than the other way around.

Sep 17 '22 13:09 werkt

@werkt we should actually look into it as if I remember correctly when no workers available temporarily but then come online the server does not recover. I haven't tested it recently so maybe it's no longer the case but it was in the past.

Sep 17 '22 14:09 80degreeswest

I would also like an option for buildfarm to request more workers from cloud if none are available (for example scale up asg)

Sep 17 '22 14:09 80degreeswest

We need a better consensus model if we're going to trigger events like that based on this - for instance, we will hit this particular scenario on every write, for every scheduler, for as long as the condition of the worker pool is empty. And we will also not hit it at all if no writes occur. The condition itself (empty workers) should be monitored, reported, and reacted to within a state processing system.

It's a big lift

Sep 17 '22 16:09 werkt

Do you think at the very minimum when this exception is thrown we should at least fail with an exception that does allow local fallback (if enabled)? I don't remember if it would trigger local execution currently or if bazel will just keep retrying and eventually fail.

Sep 19 '22 12:09 80degreeswest

Do you think at the very minimum when this exception is thrown we should at least fail with an exception that does allow local fallback (if enabled)?

That will be the result of seeing the already-responding UNAVAILABLE (retryable) status response in this case, unless bazel has changed significantly.

Sep 19 '22 12:09 werkt

I was able to verify that when there are no workers (io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers) the build is stuck. When never worker is introduced and is able to take on work, it does not. Instead the build is stuck and will need to be restarted. I would expect the build to resume once a new worker is introduced.

 
[SEVERE ] build.buildfarm.server.ByteStreamService queryWriteStatus - queryWriteStatus(uploads/9a042eca-7faa-4756-8c5f-ca4fc22f082a/blobs/15ee8509b163aaaadbfa8dc6235debbda9f18c8e/1395) 
com.google.common.util.concurrent.UncheckedExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3962)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946)
	at build.buildfarm.instance.shard.Writes.get(Writes.java:145)
	at build.buildfarm.instance.shard.ShardInstance.getBlobWrite(ShardInstance.java:1079)
	at build.buildfarm.server.ByteStreamService.getUploadBlobWrite(ByteStreamService.java:407)
	at build.buildfarm.server.ByteStreamService.getWrite(ByteStreamService.java:419)
	at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:339)
	at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:325)
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
	at io.grpc.Status.asRuntimeException(Status.java:526)
	at build.buildfarm.instance.shard.ShardInstance.getRandomWorker(ShardInstance.java:1027)
	at build.buildfarm.instance.shard.ShardInstance.writeInstanceSupplier(ShardInstance.java:1014)
	at build.buildfarm.instance.shard.Writes$1.load(Writes.java:131)
	at build.buildfarm.instance.shard.Writes$1.load(Writes.java:127)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	... 27 more

Sep 23 '22 14:09 80degreeswest

I was able to verify that when there are no workers (io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers) the build is stuck. When never worker is introduced and is able to take on work, it does not. Instead the build is stuck and will need to be restarted. I would expect the build to resume once a new worker is introduced.

 
[SEVERE ] build.buildfarm.server.ByteStreamService queryWriteStatus - queryWriteStatus(uploads/9a042eca-7faa-4756-8c5f-ca4fc22f082a/blobs/15ee8509b163aaaadbfa8dc6235debbda9f18c8e/1395) 
com.google.common.util.concurrent.UncheckedExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3962)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946)
	at build.buildfarm.instance.shard.Writes.get(Writes.java:145)
	at build.buildfarm.instance.shard.ShardInstance.getBlobWrite(ShardInstance.java:1079)
	at build.buildfarm.server.ByteStreamService.getUploadBlobWrite(ByteStreamService.java:407)
	at build.buildfarm.server.ByteStreamService.getWrite(ByteStreamService.java:419)
	at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:339)
	at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:325)
	at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
	at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
	at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
	at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
	at io.grpc.Status.asRuntimeException(Status.java:526)
	at build.buildfarm.instance.shard.ShardInstance.getRandomWorker(ShardInstance.java:1027)
	at build.buildfarm.instance.shard.ShardInstance.writeInstanceSupplier(ShardInstance.java:1014)
	at build.buildfarm.instance.shard.Writes$1.load(Writes.java:131)
	at build.buildfarm.instance.shard.Writes$1.load(Writes.java:127)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	... 27 more

Can we introduce an option than can control this behavior? In my case I would like to fail the bazel, instead of letting it wait forever.

Sep 23 '22 15:09 WSUFan

bazel-buildfarm bazel-buildfarm copied to clipboard

Question: How can I setup the remote executor timeout?

bazel-buildfarm
bazel-buildfarm copied to clipboard