bazel-buildfarm
bazel-buildfarm copied to clipboard
ShardInstance does not implement BatchReadBlobs
When I run an action that creates lots of files and directories, the server fails. I haven't found an exact limit - sometimes it fails with 80 directories with one file each, other times at 30. Here's a simple example to repro. Use this bash script:
$ cat /tmp/f.bash
#! /bin/bash
outdir=path/to/new/directory
mkdir -p $outdir
# Create 300 directories and files with the date in them
for i in {1..300}; do
mkdir $outdir/$i && \
date > $outdir/$i/$i.txt;
done
And ask the rexec tool to run this script remotely:
$ bazel-bin/go/cmd/rexec/rexec_/rexec --logtostderr --service_no_security=true \
--service=localhost:8980 --exec_root=/tmp \
--inputs=f.bash --output_directories=path/to/new/directory \
--download_outputs --download_outerr -- /bin/bash f.bash
W0801 11:56:36.162397 64778 client.go:694] Instance name was not specified.
I0801 11:56:36.162437 64778 client.go:699] Connecting to remote execution instance
I0801 11:56:36.162440 64778 client.go:700] Connecting to remote execution service localhost:8980
Remote execution error: rpc error: code = Unknown desc = retry budget exhausted (6 attempts): .
Then the server fails so:
$ bazel run //src/main/java/build/buildfarm:buildfarm-server -- \
--jvm_flag=-Djava.util.logging.config.file=$(pwd)/examples/debug.logging.properties \
$(pwd)/examples/shard-server.config.example
[...]
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@fc5a1c6 [Mon Aug 01 11:56:38 PDT 2022]
java.lang.NullPointerException
at build.buildfarm.instance.server.AbstractServerInstance.getAllBlobsFuture(AbstractServerInstance.java:404)
at build.buildfarm.server.ContentAddressableStorageService.batchReadBlobs(ContentAddressableStorageService.java:248)
at build.buildfarm.server.ContentAddressableStorageService.batchReadBlobs(ContentAddressableStorageService.java:270)
at build.bazel.remote.execution.v2.ContentAddressableStorageGrpc$MethodHandlers.invoke(ContentAddressableStorageGrpc.java:413)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Creating this here to see if anyone has pointers on where to start poking.
hmm, I'm realizing now that this might also be a problem with the rexec tool itself, somehow calling the "completed" call and then asking for the next file from the server. Let me pull on this thread.
Regardless we shouldn't be NPEing in any case. I'll look from my side
The problem seems to be that the shard.ShardInstance does not implement BatchReadBlobs API.
From what I can tell, this has nothing to do with the number of directories etc. It's just the API used to download files. I was able to repro the problem by running a Shard Server and calling this API via grpcurl:
$ grpcurl -plaintext -d @ localhost:8980 build.bazel.remote.execution.v2.ContentAddressableStorage.BatchReadBlobs <<EOM
{
"digests" : {"hash": "3b3f22ecd9f6382fb7e4897516c30b6b39bbbcd665123b3f699a2d5688cbe8f6", "size_bytes": 233}
}
EOM
The NullPointerException is triggered because the AbstractServerInstance.java defines getAllBlobsFuture as so:
@Override
public ListenableFuture<Iterator<Response>> getAllBlobsFuture(Iterable<Digest> digests) {
return contentAddressableStorage.getAllFuture(digests);
}
But ShardInstance sets contentAddressableStorage to null in the constructor. I've updated the title/summary to reflect this.
Nice find. This just makes me want to be definitive about my desire to have the shard cas be a real boy. I'll see about getting something up and in place for this.