alluxio Add benchmark for ReadResponseMarshaller

What changes are proposed in this pull request?

Added a benchmark for the marshalling/unmarshalling performance of ReadResponseMarshaller in comparison with the baseline marshaller implementation MesasgeMarshaller

Why are the changes needed?

Shed light on the performance characteristics of alluxio's zero-copy implementation

Does this PR introduce any user facing changes?

No

Jun 29 '22 23:06 YangchenYe323

Screen Shot 2022-06-29 at 8 31 35 PM A tentative run of this benchmark shows that `ReadResponseMarshaller` actually does worse on the `drainTo` method than the baseline implementation.

It does improved a lot over the read method on the baseline implementation, which sonstructs a new ByteArrayInputStream over the protobuf entity. However, drainTo in the baseline implementation does not have this copy in the first place.

Jun 30 '22 00:06 YangchenYe323

Screen Shot 2022-06-29 at 8 49 51 PM Above is the benchmark for unmarshalling performance. Counting all the necessary steps of `ReadResponseMarshaller`, its throughput advantage over the baseline is really negligible. Either there's serious flaw in my benchmark implementation, or we should consider seriously if this piece of hacky, underdocumented, hard-to-maintain code is doing any good.

Jun 30 '22 00:06 YangchenYe323

I've run some experimental grpc read benchmarks using https://github.com/TachyonNexus/spark-dfsio on my macbook. The result suggests that there's no significant performance gain in using zero-copy either for worker or for client.

Hardware: 2.3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4
Alluxio version: current master
Spark version: 3.3.0
Job command:

./bin/spark-submit \
--class alluxio.benchmarks.TestDFSIO \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.scheduler.maxRegisteredResourcesWaitingTime=60s" \
--conf "spark.executor.extraJavaOptions=-Dalluxio.user.block.size.bytes.default=128MB -Dalluxio.user.file.readtype.default=NO_CACHE -Dalluxio.user.file.writetype.default=MUST_CACHE -Dalluxio.user.short.circuit.enabled=false -Dalluxio.user.streaming.zerocopy.enabled=false" \
benchmarks-1.0.0-SNAPSHOT-jar-with-dependencies.jar -p 4 -s 1000 -o wr -b alluxio://192.168.2.20:19998/testdfsio/

Following are test results under different configurations:

Short Circuit Enabled	Client ZeroCopy Enabled	Worker ZeroCopy Enabled	Read Throughput (MB/s)	Write Throughput (MB/s)
Yes	-	-	~266	~591
No	No	No	~250	~455
No	Yes	Yes	~278	~482
No	No	Yes	~288	~496

Though it's not a statistically solid test, my feeling is that the variations are normal fluctuations rather than correlated with the zero copy implementation.

Jul 26 '22 00:07 YangchenYe323

Re-runed with ByteArrayOutputStream as final consumer of serialized stream. It can be seen that marshalZeroCopy is still not better than marshalBaselineDrain

Aug 18 '22 03:08 YangchenYe323

After a second look into the code, I realized I've made a big mistake in this benchmark so the performance gain cannot be seen: DataMessageMarshaller is designed to work with a specific OutputStream, i.e., BufferChainOutputStream in grpc-java's internal package. This stream keeps track of a list of buffer references, and DataMessageMarshaller avoids copy by appending the buffer reference it contains directly to that internal list of BufferChainOutputStream.

Aug 24 '22 13:08 YangchenYe323

alluxio alluxio copied to clipboard

Add benchmark for ReadResponseMarshaller

What changes are proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user facing changes?

alluxio
alluxio copied to clipboard