bazel-buildfarm expire Operation in backplane

I just started using Buildfarm not long ago, and I noticed that when the Buildfarm server puts an Operation into the OperationQueue in the backplane, if there is no suitable worker to match this Operation, the Operation will remain in the backplane indefinitely. Do we have any timeout or expiration parameters to configure this situation? thanks

Oct 24 '23 07:10 coder1363691

There are no timeouts for this per se - these essentially represent operations in flight that are waiting for appropriate workers to pull them off the mentioned queue. I'll turn this around and ask: what sort of behavior would you expect to happen when there are effectively no workers available to pull from a queue right now? There are implications here as well that indicate that we need to act with some tolerance to transiency - should this be size or age based? have a ttl on exiting from a queue? Will a ttl (essentially introducing a maximum latency for a queue) be appropriate time-wise, since it will slow down every misconfigured build's actions by this amount each time? We could also use active configuration awareness from workers, on servers, to decide whether there is any worker that is currently available that could pull from this queue, assuming it matches its requirements.

All of this requires additional monitoring and consensus. For what is essentially misconfiguration, is this required? Or could an external system monitor fill the role of detecting this state and alerting an administrator?

Oct 24 '23 11:10 werkt

@werkt Thank you for the discussion. Indeed, situations like this misconfiguration should not occur or occur very rarely.However, once it happens, it results in the strange situation which requiring manual intervention from the backplane to resolve the issue. Ideally, I hope to set a timeout for the operations in the OperationQueue, such as OperationQueueTimeout. When scanning the queue, if we find that a task has been stalled for too long, we can terminate that operation and notify the client that the task has timed out. this differs from the actual execution timeout configuration, known as defaultActionTimeout. OperationQueueTimeout represents the maximum timeout for tasks that cannot be consumed, while defaultActionTimeout represents tasks that can be consumed but may time out during execution. However, this is just a possible suggestion. If necessary, I may try to fix this issue locally first. If there are further ideas in the future, we can continue the discussion.

Oct 25 '23 03:10 coder1363691

@werkt hello, werkt, I made some modifications to the previous discussion, when an Operation stays in the OperationQueue too long, I will attempt to clean it up. It seems to be working fine from my test:

           protected void visit(QueueEntry queueEntry, String queueEntryJson) {
             onOperationName.accept(queueEntry.getExecuteEntry().getOperationName());
+            // check task timeout in queue
+            long queueAt = queueEntry.getExecuteEntry().getQueuedTimestamp().getSeconds();
+            long now = System.currentTimeMillis() / 1000;
+            long durationTime = now - queueAt;
+            if (durationTime > configs.getBackplane().getMaxQueueTimeout()) {
+              Status status = Status.newBuilder()
+                .setCode(Code.CANCELLED.getNumber()).setMessage("Operation Queued Timeout").build();
+              ExecuteEntry executeEntry = queueEntry.getExecuteEntry();
+              ExecuteOperationMetadata metadata =
+              ExecuteOperationMetadata.newBuilder()
+                  .setActionDigest(executeEntry.getActionDigest())
+                  .setStdoutStreamName(executeEntry.getStdoutStreamName())
+                  .setStderrStreamName(executeEntry.getStderrStreamName())
+                  .setStage(ExecutionStage.Value.COMPLETED)
+                  .build();
+              Operation queueTimeoutOperation = Operation.newBuilder()
+                  .setName(executeEntry.getOperationName())
+                  .setDone(true)
+                  .setMetadata(Any.pack(metadata))
+                  .setResponse(Any.pack(ExecuteResponse.newBuilder().setStatus(status).build()))
+                  .build();
+              // publish operation status
+              try {
+                putOperation(queueTimeoutOperation, ExecutionStage.Value.COMPLETED);
+              } catch (IOException e) {
+                log.log(Level.SEVERE, format("Error put expired %s", executeEntry.getOperationName()), e);
+              }
+              // remove operation from queue
+              if (!state.operationQueue.removeFromQueue(jedis, queueEntryJson)) {
+                log.log(Level.WARNING, format("removeFromQueue %s failed",executeEntry.getOperationName()));
+              } else {
+                queueTimeoutCounter.inc();
+                log.log(Level.WARNING, format("Operation queued expired,%s", executeEntry.getOperationName()));
+              }
+            }
           }

In addition, I find an interesting issue while using bazel-remote (https://github.com/buchgr/bazel-remote/) as a CAS (Content Addressable Storage) service. When there is a previously uploaded (hash, blob) in the CAS, here https://github.com/bazelbuild/bazel-buildfarm/blob/8d6e93fe0798978bff997c78458ac00fc35d0eeb/src/main/java/build/buildfarm/common/grpc/StubWriteOutputStream.java#L258C8-L258C8 will throws an exception without properly handling the writeObserver. This leads to a goroutine leak on the bazel-remote side. I made some modifications to the logic locally, and it seems to be working as expected now. Do you have any ideas about this?

   public void write(byte[] b, int off, int len) throws IOException {
     if (isComplete()) {
+      synchronized (this) {
+        if (writeObserver != null) {
+          writeObserver.onCompleted();
+          writeObserver = null;
+        }
+      }
       throw new WriteCompleteException();
     }

Dec 13 '23 07:12 coder1363691

That change to the visitor for queue lifetime looks pretty good - can comment further if you put up a PR for it.

Wasn't aware of that leak, and I've encountered some recent edge cases around StubWriteOutputStream in general that make me concerned for its resource implications overall - I don't think that covers quite enough of the cases where we should complete the write observer though - I would rather that we do this as a response to the onNext reception immediately, so that we're properly triggering off of events (which aren't dependent upon the client actually calling write, or any other method).

Dec 13 '23 14:12 werkt

@werkt OK, i will try a PR later BTW, i saw you commit here, https://github.com/bazelbuild/bazel-buildfarm/commit/dd5c87bdc8b5b18bd039e21cd87b8a40aa289d49 and i think always handle writeObserve.onCompleted() in close() is a better idea, i test it and it works fine, and i will patch my local as what this commit did

Dec 15 '23 02:12 coder1363691

bazel-buildfarm bazel-buildfarm copied to clipboard

expire Operation in backplane

bazel-buildfarm
bazel-buildfarm copied to clipboard