bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

[zstd] Compressed artifacts are corrupted

Open brujoand opened this issue 2 years ago • 4 comments

Running buildfarm 2.0.0, bazel 5.3.1 With flags:

common:remote --remote_retries=3 --remote_timeout=60s --remote_local_fallback --remote_executor="grpc://buildfarm.url:31310"
common:remote --experimental_remote_merkle_tree_cache --experimental_remote_cache_compression --experimental_remote_cache_async
common:remote --remote_default_exec_properties=cores=1 --remote_default_exec_properties=test.cores=2

Produces many instances of this error:

SEVERE: error writing data for uploads/1c09d1a8-69d1-4042-9bbe-8050c8bf4f61/compressed-blobs/zstd/94f7a39e174952273b01ce0107e8fe6c1b7098a6cd396244805552ba93ae0053/1808946688 [Wed Nov 02 07:33:10 GMT 2022]
java.io.IOException: Decompression error: Corrupted block detected
	at com.github.luben.zstd.ZstdInputStreamNoFinalizer.readInternal(ZstdInputStreamNoFinalizer.java:171)
	at com.github.luben.zstd.ZstdInputStreamNoFinalizer.read(ZstdInputStreamNoFinalizer.java:123)
	at com.google.protobuf.ByteString.readChunk(ByteString.java:540)
	at com.google.protobuf.ByteString.readFrom(ByteString.java:517)
	at com.google.protobuf.ByteString.readFrom(ByteString.java:485)
	at build.buildfarm.common.ZstdDecompressingOutputStream.write(ZstdDecompressingOutputStream.java:63)
	at build.buildfarm.cas.cfc.CASFileCache$8.write(CASFileCache.java:2722)
	at build.buildfarm.cas.cfc.WriteOutputStream.write(WriteOutputStream.java:43)
	at build.buildfarm.cas.cfc.WriteOutputStream.write(WriteOutputStream.java:43)
	at build.buildfarm.cas.cfc.CASFileCache$5.write(CASFileCache.java:1039)
	at build.buildfarm.cas.cfc.WriteOutputStream.write(WriteOutputStream.java:43)
	at build.buildfarm.cas.cfc.CASFileCache$UniqueWriteOutputStream.write(CASFileCache.java:772)
	at build.buildfarm.cas.cfc.CASFileCache$UniqueWriteOutputStream.write(CASFileCache.java:764)
	at com.google.protobuf.ByteString$LiteralByteString.writeTo(ByteString.java:1381)
	at build.buildfarm.common.services.WriteStreamObserver.writeData(WriteStreamObserver.java:392)
	at build.buildfarm.common.services.WriteStreamObserver.handleWrite(WriteStreamObserver.java:365)
	at build.buildfarm.common.services.WriteStreamObserver.handleRequest(WriteStreamObserver.java:298)
	at build.buildfarm.common.services.WriteStreamObserver.initialize(WriteStreamObserver.java:244)
	at build.buildfarm.common.services.WriteStreamObserver.onUncommittedNext(WriteStreamObserver.java:123)
	at build.buildfarm.common.services.WriteStreamObserver.onNext(WriteStreamObserver.java:102)
	at build.buildfarm.common.services.WriteStreamObserver.onNext(WriteStreamObserver.java:55)
	at io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)
	at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)
	at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:765)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

The builds still mostly work, but are much slower and sometimes fail due to missing blobs. Could this somehow be related to the client timing out and ending the stream prematurly? If not, how could I best proceed to debug this?

brujoand avatar Nov 03 '22 09:11 brujoand

On the client I'm seeing this:

WARNING: Remote Cache: reached end of stream after skipping 244341248 bytes; 290881536 bytes expected

brujoand avatar Nov 03 '22 09:11 brujoand

Deleting previous recommendation. I believe this was caused by not reporting an uncompressed committed size, and experiencing a client restart. Of note, however, is that there is a skip for the uncompressed completion beyond the size of the blob, which may mean that you're actually experiencing inflation, rather than deflation with zstd (easily possible if an input blob is gzipped). I've put up https://github.com/werkt/bazel-buildfarm/tree/zstd-committed-size to correct for the reporting violation.

werkt avatar Nov 21 '22 06:11 werkt

@brujoand have you tried the tree mentioned above? did it address the issue? If so, I'd like to merge it.

werkt avatar Jun 19 '23 12:06 werkt

@werkt are you running on this patch or no longer having the issue? I've hit a few spikes of Decompression error: Corrupted block detected recently - and I'm testing this out for a few days to see if this helps - the change seems like it's in the right direction none the less.

jerrymarino avatar Sep 22 '23 03:09 jerrymarino