alluxio
alluxio copied to clipboard
Add benchmark for ReadResponseMarshaller
What changes are proposed in this pull request?
Added a benchmark for the marshalling/unmarshalling performance of ReadResponseMarshaller
in comparison with the baseline marshaller implementation MesasgeMarshaller
Why are the changes needed?
Shed light on the performance characteristics of alluxio's zero-copy implementation
Does this PR introduce any user facing changes?
No

It does improved a lot over the read
method on the baseline implementation, which sonstructs a new ByteArrayInputStream
over the protobuf entity. However, drainTo
in the baseline implementation does not have this copy in the first place.

I've run some experimental grpc read benchmarks using https://github.com/TachyonNexus/spark-dfsio on my macbook. The result suggests that there's no significant performance gain in using zero-copy either for worker or for client.
- Hardware: 2.3 GHz 8-Core Intel Core i9, 16 GB 2667 MHz DDR4
- Alluxio version: current master
- Spark version: 3.3.0
- Job command:
./bin/spark-submit \
--class alluxio.benchmarks.TestDFSIO \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.scheduler.maxRegisteredResourcesWaitingTime=60s" \
--conf "spark.executor.extraJavaOptions=-Dalluxio.user.block.size.bytes.default=128MB -Dalluxio.user.file.readtype.default=NO_CACHE -Dalluxio.user.file.writetype.default=MUST_CACHE -Dalluxio.user.short.circuit.enabled=false -Dalluxio.user.streaming.zerocopy.enabled=false" \
benchmarks-1.0.0-SNAPSHOT-jar-with-dependencies.jar -p 4 -s 1000 -o wr -b alluxio://192.168.2.20:19998/testdfsio/
Following are test results under different configurations:
Short Circuit Enabled | Client ZeroCopy Enabled | Worker ZeroCopy Enabled | Read Throughput (MB/s) | Write Throughput (MB/s) |
---|---|---|---|---|
Yes | - | - | ~266 | ~591 |
No | No | No | ~250 | ~455 |
No | Yes | Yes | ~278 | ~482 |
No | No | Yes | ~288 | ~496 |
Though it's not a statistically solid test, my feeling is that the variations are normal fluctuations rather than correlated with the zero copy implementation.
Re-runed with ByteArrayOutputStream
as final consumer of serialized stream. It can be seen that marshalZeroCopy
is still not better than marshalBaselineDrain

After a second look into the code, I realized I've made a big mistake in this benchmark so the performance gain cannot be seen: DataMessageMarshaller
is designed to work with a specific OutputStream, i.e., BufferChainOutputStream
in grpc-java's internal package. This stream keeps track of a list of buffer references, and DataMessageMarshaller
avoids copy by appending the buffer reference it contains directly to that internal list of BufferChainOutputStream
.