bazel-buildfarm
bazel-buildfarm copied to clipboard
Question: How can I setup the remote executor timeout?
Hi guys,
This project is really good. I really love it. However, I do have a question.
I started one scheduler and one worker remotely. Normally, everything works fine, I can do bazel build/test remotely. But, if the worker is down, then it seems like the scheduler will be in an infinite loop where it throws the "no available workers" exception. Bazel client gets stuck too. So my question is, in this situation, how Bazel client can stop the remote execution and return a result? I tried --test_timeout
but it did not work.
Thanks
We're glad you like it.
Regarding "no available workers" - this is a response that gets sent as a part of a request to upload cas content to a shard cluster. Our choice to answer like this is based on a desire to avoid making a client wait forever for a situation to resolve itself that will likely require manual intervention.
We can only indirectly affect what the client will do in this circumstance, and bazel has chosen to be in a loop for these kinds of responses. There are the following options to bazel which may still effect the 'stop and return result' scenario you're interested in:
--[no]remote_local_fallback (a boolean; default: "false")
Whether to fall back to standalone local execution strategy if remote
execution fails.
--remote_local_fallback_strategy (a string; default: "local")
No-op, deprecated. See https://github.com/bazelbuild/bazel/issues/7480 for
details.
Based on the copy in the second one, I'm not sure how effective it will be.
These edge cases are very much in the bazel client, and not buildfarm's domain to resolve - clients should imho always be written to avoid getting stuck, handle failure, and expose flaws in the remote system, rather than the other way around.
@werkt we should actually look into it as if I remember correctly when no workers available temporarily but then come online the server does not recover. I haven't tested it recently so maybe it's no longer the case but it was in the past.
I would also like an option for buildfarm to request more workers from cloud if none are available (for example scale up asg)
We need a better consensus model if we're going to trigger events like that based on this - for instance, we will hit this particular scenario on every write, for every scheduler, for as long as the condition of the worker pool is empty. And we will also not hit it at all if no writes occur. The condition itself (empty workers) should be monitored, reported, and reacted to within a state processing system.
It's a big lift
Do you think at the very minimum when this exception is thrown we should at least fail with an exception that does allow local fallback (if enabled)? I don't remember if it would trigger local execution currently or if bazel will just keep retrying and eventually fail.
Do you think at the very minimum when this exception is thrown we should at least fail with an exception that does allow local fallback (if enabled)?
That will be the result of seeing the already-responding UNAVAILABLE (retryable) status response in this case, unless bazel has changed significantly.
I was able to verify that when there are no workers (io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers) the build is stuck. When never worker is introduced and is able to take on work, it does not. Instead the build is stuck and will need to be restarted. I would expect the build to resume once a new worker is introduced.
[SEVERE ] build.buildfarm.server.ByteStreamService queryWriteStatus - queryWriteStatus(uploads/9a042eca-7faa-4756-8c5f-ca4fc22f082a/blobs/15ee8509b163aaaadbfa8dc6235debbda9f18c8e/1395)
com.google.common.util.concurrent.UncheckedExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
at com.google.common.cache.LocalCache.get(LocalCache.java:3962)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946)
at build.buildfarm.instance.shard.Writes.get(Writes.java:145)
at build.buildfarm.instance.shard.ShardInstance.getBlobWrite(ShardInstance.java:1079)
at build.buildfarm.server.ByteStreamService.getUploadBlobWrite(ByteStreamService.java:407)
at build.buildfarm.server.ByteStreamService.getWrite(ByteStreamService.java:419)
at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:339)
at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:325)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers
at io.grpc.Status.asRuntimeException(Status.java:526)
at build.buildfarm.instance.shard.ShardInstance.getRandomWorker(ShardInstance.java:1027)
at build.buildfarm.instance.shard.ShardInstance.writeInstanceSupplier(ShardInstance.java:1014)
at build.buildfarm.instance.shard.Writes$1.load(Writes.java:131)
at build.buildfarm.instance.shard.Writes$1.load(Writes.java:127)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
... 27 more
I was able to verify that when there are no workers (io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers) the build is stuck. When never worker is introduced and is able to take on work, it does not. Instead the build is stuck and will need to be restarted. I would expect the build to resume once a new worker is introduced.
[SEVERE ] build.buildfarm.server.ByteStreamService queryWriteStatus - queryWriteStatus(uploads/9a042eca-7faa-4756-8c5f-ca4fc22f082a/blobs/15ee8509b163aaaadbfa8dc6235debbda9f18c8e/1395) com.google.common.util.concurrent.UncheckedExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051) at com.google.common.cache.LocalCache.get(LocalCache.java:3962) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3985) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4946) at build.buildfarm.instance.shard.Writes.get(Writes.java:145) at build.buildfarm.instance.shard.ShardInstance.getBlobWrite(ShardInstance.java:1079) at build.buildfarm.server.ByteStreamService.getUploadBlobWrite(ByteStreamService.java:407) at build.buildfarm.server.ByteStreamService.getWrite(ByteStreamService.java:419) at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:339) at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:325) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers at io.grpc.Status.asRuntimeException(Status.java:526) at build.buildfarm.instance.shard.ShardInstance.getRandomWorker(ShardInstance.java:1027) at build.buildfarm.instance.shard.ShardInstance.writeInstanceSupplier(ShardInstance.java:1014) at build.buildfarm.instance.shard.Writes$1.load(Writes.java:131) at build.buildfarm.instance.shard.Writes$1.load(Writes.java:127) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045) ... 27 more
Can we introduce an option than can control this behavior? In my case I would like to fail the bazel, instead of letting it wait forever.