bazel-buildfarm
bazel-buildfarm copied to clipboard
CANCELLED: io.grpc.Context was cancelled without error
Good afternoon 👋 . I'm using buildfarm to build / run some large iOS tests and was hitting some opauqe GRPC errors:
Buildfarm-server ( commit: 1c99c7a60e399bedf5fc8700bd3f3be482144207 ) - redis shard backplane
I end up getting a ton of CANCELLED: io.grpc.Context was cancelled without error in the server logs and consequently on the Bazel client. Do you have any suggestions about how to most effectively diagnose this or have a pointer about where the cause is rooted? I imagined it to stem from a worker issue at first
2022-09-14T19:36:23.747-07:00 [FINE ] io.grpc.internal.ClientCallImpl logIfContextNarrowedTimeout - Call timeout set to '59999689958' ns, due to context deadline. Explicit call timeout was not set.
2022-09-14T19:36:23.752-07:00 [SEVERE ] build.buildfarm.instance.shard.ShardInstance$11 onFailure - failed to requeue: shard/operations/36a4313b-3f79-40a9-9f38-37825a02dc13
2022-09-14T19:36:23.752-07:00 io.grpc.StatusException: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.752-07:00 at io.grpc.Status.asException(Status.java:551)
2022-09-14T19:36:23.752-07:00 at io.grpc.protobuf.StatusProto.toStatusException(StatusProto.java:85)
2022-09-14T19:36:23.752-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateInputs(AbstractServerInstance.java:964)
2022-09-14T19:36:23.752-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateQueuedOperationAndInputs(AbstractServerInstance.java:996)
2022-09-14T19:36:23.752-07:00 at build.buildfarm.instance.shard.ShardInstance.lambda$validateAndRequeueOperation$14(ShardInstance.java:1563)
2022-09-14T19:36:23.752-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
2022-09-14T19:36:23.752-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
2022-09-14T19:36:23.752-07:00 at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
2022-09-14T19:36:23.752-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2022-09-14T19:36:23.752-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2022-09-14T19:36:23.752-07:00 at java.base/java.lang.Thread.run(Thread.java:833)
2022-09-14T19:36:23.752-07:00
2022-09-14T19:36:23.752-07:00 [INFO ] build.buildfarm.instance.server.AbstractServerInstance logFailedStatus - e36020e1256535324cc8b71bcc98ba927b863fabcda76fc9699cc91200a7630b/164: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.752-07:00
2022-09-14T19:36:23.753-07:00 [SEVERE ] build.buildfarm.instance.shard.ShardInstance$11 onFailure - failed to requeue: shard/operations/f1a86992-fafb-4801-ad7d-06208013c5b8
2022-09-14T19:36:23.753-07:00 io.grpc.StatusException: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.753-07:00 at io.grpc.Status.asException(Status.java:551)
2022-09-14T19:36:23.753-07:00 at io.grpc.protobuf.StatusProto.toStatusException(StatusProto.java:85)
2022-09-14T19:36:23.753-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateInputs(AbstractServerInstance.java:964)
2022-09-14T19:36:23.753-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateQueuedOperationAndInputs(AbstractServerInstance.java:996)
2022-09-14T19:36:23.753-07:00 at build.buildfarm.instance.shard.ShardInstance.lambda$validateAndRequeueOperation$14(ShardInstance.java:1563)
2022-09-14T19:36:23.753-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
2022-09-14T19:36:23.753-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
2022-09-14T19:36:23.753-07:00 at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
2022-09-14T19:36:23.753-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2022-09-14T19:36:23.753-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2022-09-14T19:36:23.753-07:00 at java.base/java.lang.Thread.run(Thread.java:833)
2022-09-14T19:36:23.753-07:00
2022-09-14T19:36:23.753-07:00 [INFO ] build.buildfarm.instance.server.AbstractServerInstance logFailedStatus - 9b70cccb195f62db135941290c9c0681ae0e128b277dddc1881937fa3e8aa81c/164: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.753-07:00
2022-09-14T19:36:23.753-07:00 [INFO ] build.buildfarm.instance.shard.DispatchedMonitor lambda$requeueDispatchedOperation$0 - DispatchedMonitor::run: requeue(shard/operations/36a4313b-3f79-40a9-9f38-37825a02dc13) 10.9057ms
2022-09-14T19:36:23.754-07:00 [SEVERE ] build.buildfarm.instance.shard.ShardInstance$11 onFailure - failed to requeue: shard/operations/e2ec6be9-3469-4a51-b00c-4ecb484d92a8
2022-09-14T19:36:23.754-07:00 io.grpc.StatusException: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.754-07:00 at io.grpc.Status.asException(Status.java:551)
2022-09-14T19:36:23.754-07:00 at io.grpc.protobuf.StatusProto.toStatusException(StatusProto.java:85)
2022-09-14T19:36:23.754-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateInputs(AbstractServerInstance.java:964)
2022-09-14T19:36:23.754-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateQueuedOperationAndInputs(AbstractServerInstance.java:996)
2022-09-14T19:36:23.754-07:00 at build.buildfarm.instance.shard.ShardInstance.lambda$validateAndRequeueOperation$14(ShardInstance.java:1563)
2022-09-14T19:36:23.754-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
2022-09-14T19:36:23.754-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
2022-09-14T19:36:23.754-07:00 at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
2022-09-14T19:36:23.754-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2022-09-14T19:36:23.754-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2022-09-14T19:36:23.754-07:00 at java.base/java.lang.Thread.run(Thread.java:833)
2022-09-14T19:36:23.754-07:00
2022-09-14T19:36:23.754-07:00 [INFO ] build.buildfarm.instance.server.AbstractServerInstance logFailedStatus - d44134242465c7333fae53898fc876e1e4d6828b8fb3d67edc99a406e192ae01/164: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.754-07:00
2022-09-14T19:36:23.755-07:00 [INFO ] build.buildfarm.instance.shard.DispatchedMonitor lambda$requeueDispatchedOperation$0 - DispatchedMonitor::run: requeue(shard/operations/f1a86992-fafb-4801-ad7d-06208013c5b8) 10.4294ms
2022-09-14T19:36:23.757-07:00 [INFO ] build.buildfarm.instance.shard.DispatchedMonitor lambda$requeueDispatchedOperation$0 - DispatchedMonitor::run: requeue(shard/operations/e2ec6be9-3469-4a51-b00c-4ecb484d92a8) 9.97738ms
2022-09-14T19:36:23.790-07:00 [SEVERE ] build.buildfarm.instance.shard.ShardInstance$11 onFailure - failed to requeue: shard/operations/3398ad2d-fd71-48c3-9c7d-bb82c7ea3b3b
2022-09-14T19:36:23.790-07:00 io.grpc.StatusException: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.790-07:00 at io.grpc.Status.asException(Status.java:551)
2022-09-14T19:36:23.790-07:00 at io.grpc.protobuf.StatusProto.toStatusException(StatusProto.java:85)
2022-09-14T19:36:23.790-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateInputs(AbstractServerInstance.java:964)
2022-09-14T19:36:23.790-07:00 at build.buildfarm.instance.server.AbstractServerInstance.validateQueuedOperationAndInputs(AbstractServerInstance.java:996)
2022-09-14T19:36:23.790-07:00 at build.buildfarm.instance.shard.ShardInstance.lambda$validateAndRequeueOperation$14(ShardInstance.java:1563)
2022-09-14T19:36:23.790-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
2022-09-14T19:36:23.790-07:00 at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
2022-09-14T19:36:23.790-07:00 at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
2022-09-14T19:36:23.790-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
2022-09-14T19:36:23.790-07:00 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
2022-09-14T19:36:23.790-07:00 at java.base/java.lang.Thread.run(Thread.java:833)
2022-09-14T19:36:23.790-07:00
2022-09-14T19:36:23.790-07:00 [INFO ] build.buildfarm.instance.server.AbstractServerInstance logFailedStatus - 06101e4c33b66c4ff6792764ea92c1081afc5d8c3f3c34bfa453790bcf0d32c1/164: CANCELLED: io.grpc.Context was cancelled without error
2022-09-14T19:36:23.790-07:00
2022-09-14T19:36:23.792-07:00 [INFO ] build.buildfarm.instance.shard.DispatchedMonitor lambda$requeueDispatchedOperation$0 - DispatchedMonitor::run: requeue(shard/operations/339
Ok it looks like I'm hitting some errors in the workers - running theorory is flake somewhere is causing these to cancel/requeue - need to dig a bit more: 🤔
For more context - I'm running with execute_stage_width > 1
Sep 15, 2022 3:08:44 PM build.buildfarm.cas.cfc.CASFileCache copyExternalInput
WARNING: error downloading e04a7107f252b654f9c12821af1c0b285c993866ac660fa27f1bf36fd59cdcbe/2758
java.nio.file.NoSuchFileException: e04a7107f252b654f9c12821af1c0b285c993866ac660fa27f1bf36fd59cdcbe/2758
at build.buildfarm.instance.shard.RemoteInputStreamFactory$1.onFailure(RemoteInputStreamFactory.java:260)
at build.buildfarm.instance.shard.ShardInstance$WorkersCallback.onSuccess(ShardInstance.java:982)
at build.buildfarm.instance.shard.ShardInstance$WorkersCallback.onSuccess(ShardInstance.java:972)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321)
at com.google.common.util.concurrent.MoreExecutors$5.execute(MoreExecutors.java:1108)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:47)
at build.buildfarm.instance.shard.Util$1.complete(Util.java:121)
at build.buildfarm.instance.shard.Util$1.onSuccess(Util.java:134)
at build.buildfarm.instance.shard.Util$1.onSuccess(Util.java:118)
at build.buildfarm.instance.shard.Util$2.onSuccess(Util.java:157)
at build.buildfarm.instance.shard.Util$2.onSuccess(Util.java:154)
at build.buildfarm.instance.shard.Util$3.onSuccess(Util.java:191)
at build.buildfarm.instance.shard.Util$3.onSuccess(Util.java:182)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at io.grpc.stub.ClientCalls$GrpcFuture.set(ClientCalls.java:558)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:531)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Sep 15, 2022 3:08:44 PM build.buildfarm.worker.InputFetcher fetchPolled
SEVERE: error creating exec dir for shard/operations/1a21f12d-ac27-4461-977d-3bb33af7e9ef
build.buildfarm.worker.shard.CFCExecFileSystem$ExecDirException: /tmp/worker/shard/operations/1a21f12d-ac27-4461-977d-3bb33af7e9ef: 1 exceptions
at build.buildfarm.worker.shard.CFCExecFileSystem.checkExecErrors(CFCExecFileSystem.java:316)
at build.buildfarm.worker.shard.CFCExecFileSystem.createExecDir(CFCExecFileSystem.java:409)
at build.buildfarm.worker.shard.ShardWorkerContext.createExecDir(ShardWorkerContext.java:713)
at build.buildfarm.worker.InputFetcher.fetchPolled(InputFetcher.java:198)
at build.buildfarm.worker.InputFetcher.runInterruptibly(InputFetcher.java:101)
at build.buildfarm.worker.InputFetcher.run(InputFetcher.java:288)
at java.base/java.lang.Thread.run(Thread.java:833)
Suppressed: build.buildfarm.cas.cfc.CASFileCache$PutDirectoryException: /tmp/worker/cache/176e8f36659930b18980bbd5110caeac6265a2ee1b549ada6beabd00df938677_dir: 1 exceptions
at build.buildfarm.cas.cfc.CASFileCache.lambda$putDirectorySynchronized$20(CASFileCache.java:2138)
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:213)
at com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:202)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:118)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Suppressed: java.util.concurrent.ExecutionException: java.nio.file.NoSuchFileException: e04a7107f252b654f9c12821af1c0b285c993866ac660fa27f1bf36fd59cdcbe/2758
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:566)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:527)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:88)
at build.buildfarm.cas.cfc.CASFileCache.lambda$putDirectorySynchronized$20(CASFileCache.java:2129)
... 9 more
Caused by: java.nio.file.NoSuchFileException: e04a7107f252b654f9c12821af1c0b285c993866ac660fa27f1bf36fd59cdcbe/2758
at build.buildfarm.instance.shard.RemoteInputStreamFactory$1.onFailure(RemoteInputStreamFactory.java:260)
at build.buildfarm.instance.shard.ShardInstance$WorkersCallback.onSuccess(ShardInstance.java:982)
at build.buildfarm.instance.shard.ShardInstance$WorkersCallback.onSuccess(ShardInstance.java:972)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321)
at com.google.common.util.concurrent.MoreExecutors$5.execute(MoreExecutors.java:1108)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:47)
at build.buildfarm.instance.shard.Util$1.complete(Util.java:121)
at build.buildfarm.instance.shard.Util$1.onSuccess(Util.java:134)
at build.buildfarm.instance.shard.Util$1.onSuccess(Util.java:118)
at build.buildfarm.instance.shard.Util$2.onSuccess(Util.java:157)
at build.buildfarm.instance.shard.Util$2.onSuccess(Util.java:154)
at build.buildfarm.instance.shard.Util$3.onSuccess(Util.java:191)
at build.buildfarm.instance.shard.Util$3.onSuccess(Util.java:182)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1080)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:247)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:163)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1213)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:746)
at io.grpc.stub.ClientCalls$GrpcFuture.set(ClientCalls.java:558)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:531)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Update: it looks like this happens when I run with the perfect storm of "big/little non-homogenous arm64 CPU" + execution_width:nCpus ( maybe funny here ) concurrent actions, populated cache size, long uptime. I have a running theory it's a mix of load / concurrency I'm hitting these with to evoke it. Suddenly by running execution_width:1 - it seems to go away. I'm also testing this locally - with a pretty crappy network setting.
Mainly, I cannot reproduce this very well locally ( all clusters deployed on 1 machine ). Unfortunately, I'll need to work to have better logging on the macOS VM system do diagnose it outside of the errors I've gotten.
Your original issue belies two things: we're breaking a future chain between validateInputs's findMissingBlobs call and the chain in validateAndRequeueOperation, which is bad and prevents us from seeing the actual grpc request where things started failing.
And also that the findMissingBlobs is running with a cancelled context. Which means that somewhere in the nest of future sequences and transitions between executors, something has cancelled the active grpc context for the requeue. From some rudimentary tracing, I believe this is because of operationTransformService's cached thread pool, where one of its thread-local contexts has experienced a cancellation and not been scoped properly. There may be another error to start all of this, and the rest of the cancelled context exceptions are just innocent bystanders.