bazel-buildfarm
bazel-buildfarm copied to clipboard
log messages `WARNING: write: uploads/ ... no available workers`.
I tried to build Android AOSP with bazel-buildfarm. My buildfarm-server shows warnings with exceptions sporadically.
I want to know if these logs can be ignored. If not, I wish you could give me some hints to resolve.
My installation is via Helm, image: bazelbuild/buildfarm-server:2.8.0.
logs
Apr 26, 2024 1:55:53 AM build.buildfarm.common.services.WriteStreamObserver logWriteRequest WARNING: write: uploads/590d39fa-1ccc-4552-8791-99ebefde29e3/blobs/4580bba951dc0ea2672338f623f83a9d665fb72caf97cb3f41530719a24f1bd8/322, 322 bytes, finish_write com.google.common.util.concurrent.UncheckedExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2086) at com.google.common.cache.LocalCache.get(LocalCache.java:4012) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4035) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5011) at build.buildfarm.instance.shard.Writes.get(Writes.java:151) at build.buildfarm.instance.shard.ServerInstance.getBlobWrite(ServerInstance.java:1295) at build.buildfarm.common.services.ByteStreamService.getUploadBlobWrite(ByteStreamService.java:434) at build.buildfarm.common.services.WriteStreamObserver.getWrite(WriteStreamObserver.java:137) at build.buildfarm.common.services.WriteStreamObserver.initialize(WriteStreamObserver.java:224) at build.buildfarm.common.services.WriteStreamObserver.onUncommittedNext(WriteStreamObserver.java:127) at build.buildfarm.common.services.WriteStreamObserver.onNext(WriteStreamObserver.java:106) at build.buildfarm.common.services.WriteStreamObserver.onNext(WriteStreamObserver.java:56) at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262) at io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33) at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onMessage(TransmitStatusRuntimeExceptionInterceptor.java:65) at io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33) at io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:324) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:309) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: no available workers at io.grpc.Status.asRuntimeException(Status.java:530) at build.buildfarm.instance.shard.ServerInstance.getRandomWorker(ServerInstance.java:1246) at build.buildfarm.instance.shard.ServerInstance.writeInstanceSupplier(ServerInstance.java:1234) at build.buildfarm.instance.shard.Writes$1.load(Writes.java:132) at build.buildfarm.instance.shard.Writes$1.load(Writes.java:128) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3571) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2313) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2190) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2080)
Your worker must be dying at some point and deregistering/being expired from the backplane.
Present some logs from the worker, or verify why it is dropping off the map.
Thank you for your response. I couldn't find logs related to this issue from any workers. (I have 2 workers.) And I checked k8s pods and it seems that workers (and also Redis) don't restart.
How can I change the log-level for workers?
The error originates from this line: https://github.com/bazelbuild/bazel-buildfarm/blob/6f3fbe48517f34cacaf3f7fb79686b86e786c6ba/src/main/java/build/buildfarm/instance/shard/ServerInstance.java#L1300
At the time that it is thrown, no workers are present in the server's retention of the redis hash key "Workers_storage"
If you poll the values of that key in redis, you should see frequent updates (every 10s) to push to a time 30s into the future, by the workers that are registered. If there is a substantial oversubscription of a worker, these updates can be delayed and cause workers to be evicted due to expiration.
The question is: Are you transitioning to this state from a working one where the worker is available, or was it never accessible in the first place?
The question is: Are you transitioning to this state from a working one where the worker is available, or was it never accessible in the first place?
I suppose workers are available and accessible. Just sometimes lost. Because files in the cache directory are updated continuously in the build stage.
And it seems like this issue may be resolved after fixing #1724 . Just my guess but it may be occurred when buildfarm-shard and/or Redis are in heavy load.
Anyway this issue can be avoided. I close this. Thank you for your help.