Significant Read Performance Improvements
I recently noticed asynchronous-codec was performing slower than anticipated in a project I'm working on, read operations in particular. To get a better sense of things, I compared its performance against tokio-codec, which gave me the throughput I had been expecting for my workload.
After some digging and experimentation, I've ended up with two changes that, when combined, appear to bring asynchronous-codec's read performance effectively on par with tokio-codec.
Here's a quick overview of the changes:
- Configurable Read Buffer Capacity: I've introduced a new constructor,
FramedRead::with_capacity, which allows users to specify the initial size of the internal read buffer. Previously, this was hardcoded to 8KiB. This is similar to whattokio-codecdoes and allows fine tuning for different workloads. - Zero-Copy Reads: The internal read mechanism has been optimized to avoid an unnecessary data copy. Data is now read directly from the underlying
AsyncReadsource intoFramedRead's internalBytesMutbuffer. This eliminates an intermediate allocation and copy for each read operation. However, it does require someunsafeRust, and it relies on the de-facto contract thatfutures::io::AsyncReadimplementations will only write to the provided buffer and not read from its potentially uninitialized parts.
Some benchmarks to illustrate the impact of these changes:
Performance Benchmarks
The benchmark involves reading a 3 GiB file from fast, local NVMe storage. The BytesCodec is used to frame the data.
Debug Build Benchmarks
| Version | Buffer Size | Average Throughput (MiB/s) | Diff vs. Unmodified (%) | Diff vs. Tokio-codec (64KiB) (%) |
|---|---|---|---|---|
async-codec (Unmodified) |
8KiB | 475.76 | 0.0% | -83.8% |
async-codec (configurable buffer) |
8KiB | 500.92 | 5.3% | -82.9% |
async-codec (configurable buffer) |
64KiB | 2297.17 | 382.8% | -21.7% |
async-codec (configurable + zero copy) |
8KiB | 536.70 | 12.8% | -81.7% |
async-codec (configurable + zero copy) |
64KiB | 2932.17 | 516.3% | 0.0% |
tokio-codec |
8KiB | 541.91 | 13.9% | -81.5% |
tokio-codec |
64KiB | 2932.17 | 516.3% | 0.0% |
Release Build Benchmarks
| Version | Buffer Size | Average Throughput (MiB/s) | Diff vs. Unmodified (%) | Diff vs. Tokio-codec (64KiB) (%) |
|---|---|---|---|---|
async-codec (Unmodified) |
8KiB | 841.58 | 0.0% | -78.0% |
async-codec (configurable buffer) |
8KiB | 871.21 | 3.5% | -77.2% |
async-codec (configurable buffer) |
64KiB | 3266.56 | 288.1% | -14.5% |
async-codec (configurable + zero copy) |
8KiB | 886.15 | 5.3% | -76.8% |
async-codec (configurable + zero copy) |
64KiB | 3966.25 | 371.3% | 3.8% |
tokio-codec |
8KiB | 906.72 | 7.7% | -76.3% |
tokio-codec |
64KiB | 3819.52 | 353.8% | 0.0% |
For my workload, reads are now around 500% faster than before. Overall, asychronous-codec's read performance is now on par with tokio-codec.
Thank you for the work @rrauch.
I no longer use asynchronous-codec myself, and thus don't actively maintain it.
Maybe you want to create a fork. Happy to link to the various alternatives and archive this project.
//CC @jxs since you are using asynchronous codec as well.
https://github.com/libp2p/rust-libp2p/blob/70082df7e6181722630eabc5de5373733aac9a21/Cargo.lock#L310-L321
Hi @mxinden thanks for the ping! Can we then move this repo to the libp2p org and you give publishing rights?
@jxs done. Made you an owner and transferred to libp2p GitHub organization.
Hi, thanks for looking into this! Left a comment. Can you also share the benchmarks code? Cheers!
Sorry, it was just some throwaway code that I didn't keep.
Here is roughly what it did:
pub async fn benchmark(path: impl AsRef<Path>) -> anyhow::Result<()> {
let path = path.as_ref().to_path_buf();
for i in (1..=10) {
println!("iteration {}", i);
let mut file = tokio::fs::File::open(&path).await?;
let file_size = file.metadata().await?.len();
let buf_size = 64 * 1024 as usize;
file.set_max_buf_size(buf_size);
let start_time = SystemTime::now();
let mut reader = FramedRead::with_capacity(file.compat(), BytesCodec, buf_size);
let mut bytes_read = 0;
while let Some(chunk) = reader.try_next().await? {
bytes_read += chunk.len();
}
let duration = SystemTime::now().duration_since(start_time)?;
println!("read {} bytes in {} ms", bytes_read, duration.as_millis());
if !file_size == bytes_read as u64 {
panic!("incorrect number of bytes read");
}
println!();
}
Ok(())
}
I made some additional changes to this PR:
- the buffer allocation logic is now simpler and cleaner
- I added buffer initialization to externally supplied buffers. This was an oversight in the prior version.
- added the
set_capacitymethod to allow changing the buffer size at any time - I've added the unsafe
disable_buffer_initializationoption I suggested above
You can now easily compare the performance difference between using uninitialized and initialized buffers. The outcome will depend highly on the workload, hardware, etc.