bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

ShardInstance does not implement BatchReadBlobs

Open bshashank opened this issue 3 years ago • 4 comments
trafficstars

When I run an action that creates lots of files and directories, the server fails. I haven't found an exact limit - sometimes it fails with 80 directories with one file each, other times at 30. Here's a simple example to repro. Use this bash script:

$ cat /tmp/f.bash
#! /bin/bash
outdir=path/to/new/directory
mkdir -p $outdir

# Create 300 directories and files with the date in them
for i in {1..300}; do
    mkdir $outdir/$i && \
        date > $outdir/$i/$i.txt;
done

And ask the rexec tool to run this script remotely:

$ bazel-bin/go/cmd/rexec/rexec_/rexec --logtostderr --service_no_security=true \
 --service=localhost:8980 --exec_root=/tmp \
 --inputs=f.bash --output_directories=path/to/new/directory \
 --download_outputs --download_outerr -- /bin/bash f.bash
W0801 11:56:36.162397   64778 client.go:694] Instance name was not specified.
I0801 11:56:36.162437   64778 client.go:699] Connecting to remote execution instance 
I0801 11:56:36.162440   64778 client.go:700] Connecting to remote execution service localhost:8980
Remote execution error: rpc error: code = Unknown desc = retry budget exhausted (6 attempts): .

Then the server fails so:

$ bazel run //src/main/java/build/buildfarm:buildfarm-server -- \
   --jvm_flag=-Djava.util.logging.config.file=$(pwd)/examples/debug.logging.properties \
  $(pwd)/examples/shard-server.config.example
[...]
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@fc5a1c6 [Mon Aug 01 11:56:38 PDT 2022]
java.lang.NullPointerException
        at build.buildfarm.instance.server.AbstractServerInstance.getAllBlobsFuture(AbstractServerInstance.java:404)
        at build.buildfarm.server.ContentAddressableStorageService.batchReadBlobs(ContentAddressableStorageService.java:248)
        at build.buildfarm.server.ContentAddressableStorageService.batchReadBlobs(ContentAddressableStorageService.java:270)
        at build.bazel.remote.execution.v2.ContentAddressableStorageGrpc$MethodHandlers.invoke(ContentAddressableStorageGrpc.java:413)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at io.grpc.util.TransmitStatusRuntimeExceptionInterceptor$1.onHalfClose(TransmitStatusRuntimeExceptionInterceptor.java:74)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
        at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:797)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Creating this here to see if anyone has pointers on where to start poking.

bshashank avatar Aug 01 '22 19:08 bshashank

hmm, I'm realizing now that this might also be a problem with the rexec tool itself, somehow calling the "completed" call and then asking for the next file from the server. Let me pull on this thread.

bshashank avatar Aug 01 '22 19:08 bshashank

Regardless we shouldn't be NPEing in any case. I'll look from my side

werkt avatar Aug 01 '22 22:08 werkt

The problem seems to be that the shard.ShardInstance does not implement BatchReadBlobs API.

From what I can tell, this has nothing to do with the number of directories etc. It's just the API used to download files. I was able to repro the problem by running a Shard Server and calling this API via grpcurl:

$ grpcurl -plaintext -d @ localhost:8980 build.bazel.remote.execution.v2.ContentAddressableStorage.BatchReadBlobs <<EOM  
{
  "digests" : {"hash": "3b3f22ecd9f6382fb7e4897516c30b6b39bbbcd665123b3f699a2d5688cbe8f6", "size_bytes": 233}
}
EOM

The NullPointerException is triggered because the AbstractServerInstance.java defines getAllBlobsFuture as so:

  @Override
  public ListenableFuture<Iterator<Response>> getAllBlobsFuture(Iterable<Digest> digests) {
    return contentAddressableStorage.getAllFuture(digests);
  }

But ShardInstance sets contentAddressableStorage to null in the constructor. I've updated the title/summary to reflect this.

bshashank avatar Aug 05 '22 05:08 bshashank

Nice find. This just makes me want to be definitive about my desire to have the shard cas be a real boy. I'll see about getting something up and in place for this.

werkt avatar Aug 10 '22 21:08 werkt