BLAKE3
BLAKE3 copied to clipboard
io_uring based implementation of b3sum
Hi all! I wrote an io_uring-based implementation of b3sum here: https://github.com/1f604/liburing_b3sum
I wrote two versions: a single-threaded version in C and a multi-threaded version in C++. The single-threaded version is around 25% faster than the official Rust b3sum on my system and is slightly faster than cat to /dev/null on my system, and is also slightly faster than fio on my system. The single-threaded version is able to hash a 10GiB file in 2.899s, which works out to around 3533MiB/s, which is roughly the same as the read speed advertised for my NVME drive ("3500MB/s"). The multi-threaded implementation is around 1% slower than my single-threaded implementation.
Benchmarks
For these tests, I used the same 1 GiB (or 10 GiB) input
file and always flushed the page cache before each test, thus ensuring
that the programs are always reading from disk. Each command was run 10
times and I used the "real" result from time
to calculate
the statistics. I ran these commands on a Debian 12 system (uname -r
returns "6.1.0-9-amd64") using ext4 without disk encryption and without
LVM.
Command | Min | Median | Max |
---|---|---|---|
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 1 |
0.404s | 0.4105s | 0.416s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 2 |
0.474s | 0.4755s | 0.481s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 3 |
0.44s | 0.4415s | 0.451s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 4 |
0.443s | 0.4475s | 0.452s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 5 |
0.454s | 0.4585s | 0.462s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 6 |
0.456s | 0.4605s | 0.463s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 7 |
0.461s | 0.4635s | 0.468s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --num-threads 8 |
0.461s | 0.464s | 0.47s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time b3sum 1GB.txt --no-mmap |
0.381s | 0.386s | 0.394s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./b3sum_linux 1GB.txt --no-mmap |
0.379s | 0.39s | 0.404s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt | ./example | 0.364s | 0.3745s | 0.381s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 1GB.txt > /dev/null |
0.302s | 0.302s | 0.303s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K | ./example | 0.338s | 0.341s | 0.348s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=64K of=/dev/null |
0.303s | 0.306s | 0.308s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M | ./example | 0.538s | 0.5415s | 0.544s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time dd if=1GB.txt bs=2M of=/dev/null |
0.302s | 0.303s | 0.304s |
fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=1g --blocksize=512k --ioengine=io_uring --fsync=10000 --iodepth=2 --direct=1 --numjobs=1 --runtime=60 --group_reporting |
0.302s | 0.3025s | 0.303s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 512 2 1 0 2 0 0 |
0.301s | 0.301s | 0.302s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 512 2 1 0 2 0 0 |
0.303s | 0.304s | 0.305s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 1GB.txt 128 20 0 0 8 0 0 |
0.375s | 0.378s | 0.384s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_multithread 1GB.txt 128 20 0 0 8 0 0 |
0.304s | 0.305s | 0.307s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time xxhsum 1GB.txt |
0.318s | 0.3205s | 0.325s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time cat 10GB.txt > /dev/null |
2.903s | 2.904s | 2.908s |
echo 1 > /proc/sys/vm/drop_caches; sleep 1; time ./liburing_b3sum_singlethread 10GB.txt 512 4 1 0 4 0 0 |
2.898s | 2.899s | 2.903s |
In the table above, liburing_b3sum_singlethread and liburing_b3sum_multithread are my own io_uring-based implementations of b3sum (more details below), and I verified that my b3sum implementations always produced the same BLAKE3 hash output as the official b3sum implementation. The 1GB.txt file was generated using this command:
dd if=/dev/urandom of=1GB.txt bs=1G count=1
I installed b3sum using this command:
cargo install b3sum
$ b3sum --version b3sum 1.4.1
I downloaded the b3sum_linux program from the BLAKE3 Github Releases page (it was the latest Linux binary):
$ ./b3sum_linux --version b3sum 1.4.1
I compiled the example program from the example.c file in the BLAKE3 C repository as per the instructions in the BLAKE3 C repository:
gcc -O3 -o example example.c blake3.c blake3_dispatch.c blake3_portable.c \ blake3_sse2_x86-64_unix.S blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S \ blake3_avx512_x86-64_unix.S
I installed xxhsum using this command:
apt install xxhash
$ xxhsum --version
xxhsum 0.8.1 by Yann Collet
compiled as 64-bit x86_64 autoVec little endian with GCC 11.2.0`
Note
Note that, as the table above shows, the single-threaded version needs O_DIRECT in order to be fast (the flag that controls whether or not to use O_DIRECT is the third number after the filename in the command line arguments). The multi-threaded version is fast even without O_DIRECT (as the table shows, the multi-threaded version will hash a 1GiB file in 0.304s with O_DIRECT and 0.305s without O_DIRECT). For more details, see the article.md
in the repository, or you can view the same article here (somewhat nicer formatting than Github) or here or here
I should also mention that my implementation does sequential reads from disk and uses the BLAKE3 C library so isn't capable of hashing on multiple cores.
I would very much appreciate any feedback!