alluxio icon indicating copy to clipboard operation
alluxio copied to clipboard

Add benchmark for ReadResponseMarshaller

Open YangchenYe323 opened this issue 2 years ago • 3 comments

What changes are proposed in this pull request?

Added a benchmark for the marshalling/unmarshalling performance of ReadResponseMarshaller in comparison with the baseline marshaller implementation MesasgeMarshaller

Why are the changes needed?

Shed light on the performance characteristics of alluxio's zero-copy implementation

Does this PR introduce any user facing changes?

No

YangchenYe323 avatar Jun 29 '22 23:06 YangchenYe323

Screen Shot 2022-06-29 at 8 31 35 PM A tentative run of this benchmark shows that `ReadResponseMarshaller` actually does worse on the `drainTo` method than the baseline implementation.

It does improved a lot over the read method on the baseline implementation, which sonstructs a new ByteArrayInputStream over the protobuf entity. However, drainTo in the baseline implementation does not have this copy in the first place.

YangchenYe323 avatar Jun 30 '22 00:06 YangchenYe323

Screen Shot 2022-06-29 at 8 49 51 PM Above is the benchmark for unmarshalling performance. Counting all the necessary steps of `ReadResponseMarshaller`, its throughput advantage over the baseline is really negligible. Either there's serious flaw in my benchmark implementation, or we should consider seriously if this piece of hacky, underdocumented, hard-to-maintain code is doing any good.

YangchenYe323 avatar Jun 30 '22 00:06 YangchenYe323

I've run some experimental grpc read benchmarks using https://github.com/TachyonNexus/spark-dfsio on my macbook. The result suggests that there's no significant performance gain in using zero-copy either for worker or for client.

  • Hardware: 2.3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4
  • Alluxio version: current master
  • Spark version: 3.3.0
  • Job command:
./bin/spark-submit \
--class alluxio.benchmarks.TestDFSIO \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.scheduler.maxRegisteredResourcesWaitingTime=60s" \
--conf "spark.executor.extraJavaOptions=-Dalluxio.user.block.size.bytes.default=128MB -Dalluxio.user.file.readtype.default=NO_CACHE -Dalluxio.user.file.writetype.default=MUST_CACHE -Dalluxio.user.short.circuit.enabled=false -Dalluxio.user.streaming.zerocopy.enabled=false" \
benchmarks-1.0.0-SNAPSHOT-jar-with-dependencies.jar -p 4 -s 1000 -o wr -b alluxio://192.168.2.20:19998/testdfsio/

Following are test results under different configurations:

Short Circuit Enabled Client ZeroCopy Enabled Worker ZeroCopy Enabled Read Throughput (MB/s) Write Throughput (MB/s)
Yes - - ~266 ~591
No No No ~250 ~455
No Yes Yes ~278 ~482
No No Yes ~288 ~496

Though it's not a statistically solid test, my feeling is that the variations are normal fluctuations rather than correlated with the zero copy implementation.

YangchenYe323 avatar Jul 26 '22 00:07 YangchenYe323

Re-runed with ByteArrayOutputStream as final consumer of serialized stream. It can be seen that marshalZeroCopy is still not better than marshalBaselineDrain

Screen Shot 2022-08-17 at 5 01 29 PM

YangchenYe323 avatar Aug 18 '22 03:08 YangchenYe323

After a second look into the code, I realized I've made a big mistake in this benchmark so the performance gain cannot be seen: DataMessageMarshaller is designed to work with a specific OutputStream, i.e., BufferChainOutputStream in grpc-java's internal package. This stream keeps track of a list of buffer references, and DataMessageMarshaller avoids copy by appending the buffer reference it contains directly to that internal list of BufferChainOutputStream.

YangchenYe323 avatar Aug 24 '22 13:08 YangchenYe323