Performance
@oyvindln, can you please give a link for your performance tests?
- adler32(crc) vectorization. I rewrote the shortest version.
- Also, when rewriting I skipped on copying fields from compressor/decompressor to stack as it is in miniz. Check if it matters.
- In find_match check if bit magic with u64 (instead of u16) can boost speed by finding first different spot in xor using leading zeros count.
The benchmarks I posted on the issue in the flate2 issue were [these] from my deflate encoder. I simply replaced flate2 with flate2-oxide. (I don't think there is an easy way of using 2 versions of the same crate at the same time without renaming one of them.)
I also have this project which I've used. We could add miniz-oxide to that, though for decompression it would probably be nice to alter it to do do several runs for each file and take the average as the noise tends to be a bit high.
For adler32, there is the adler32 crate we could use.
It turns out directly comparing with miniz-backed flate2 is a bit complicated, as the compiler doesn't like that there are two crates that link to 2 versions of a c library with the same name. Maybe we could hide the parts that still link to C code behind a feature.
Though at least it's quite simple to compare to other pure rust libraries with the compression-tester binary by simply swapping flate2 with flate2-oxide.
I faced the same issue when I was writing fuzzing: I solved the problem using objcopy, adding prefix c_ to every symbol in one version of miniz. See: https://github.com/Frommi/miniz_oxide/blob/master/build_fuzz.sh
Ah, that's nice. Then we could do the same for performance comparisons.
I managed to improve match copying in tinfl_oxide a bit. loading/storing u64 and other larger types didn't seem to help, but adding a special case for 3-length matches helped a little. Using a local variable as a loop counter for overlapping matches also helped a bit, so I think it might be worth making local copies of more values, like in the C code, so the compiler can put them in registers when it finds it beneficial.
decompress_oxide's main loop isn't very optimal: I looked at asm and there is, for example, two table jumps for Action::None with some other instructions sprinkled in, when there is only need for one jump to the start of the state and in some cases this jump can be made even further into the state's code.
This loop improves performance noticeably as it is essentially a for-loop. But it reduces readability.
The main path of DECODE_LITLEN --> HUFF_DECODE_OUTER_LOOP1 --> [READ_EXTRA_BITS_LITLEN] --> DECODE_DISTANCE --> [READ_EXTRA_BITS_DISTANCE] --> HUFF_DECODE_OUTER_LOOP2 certainly has huge overhead for small matches. The main literal writing in DECODE_LITLEN suffers as well.
Yeah, C code can jump to random places using switch and goto, something miniz and many C/C++ decoders makes use of. We need to find some way around that in Rust.
We could change Action::None to continue, Action::Next for break in loops over states, and Action::End for break 'outer if not for closures. So, maybe change them to macros?
Yeah we could probably abuse loops a bit.
Seems like we're still a bit behind on 32-bit (tested on old core (1) duo laptop):
Running target/release/deps/bench-c6ee7e382af50fa6
running 13 tests
test compress_default ... bench: 38,803,194 ns/iter (+/- 167,060)
test compress_fast ... bench: 10,655,691 ns/iter (+/- 21,273)
test compress_high ... bench: 68,385,352 ns/iter (+/- 125,997)
test compress_mem_to_heap_default_miniz ... bench: 30,819,206 ns/iter (+/- 83,438)
test compress_mem_to_heap_default_oxide ... bench: 38,946,724 ns/iter (+/- 128,007)
test compress_mem_to_heap_fast_miniz ... bench: 11,878,490 ns/iter (+/- 101,661)
test compress_mem_to_heap_fast_oxide ... bench: 15,257,433 ns/iter (+/- 168,803)
test compress_mem_to_heap_high_miniz ... bench: 51,414,234 ns/iter (+/- 154,822)
test compress_mem_to_heap_high_oxide ... bench: 68,431,216 ns/iter (+/- 186,737)
test decompress ... bench: 4,383,233 ns/iter (+/- 11,595)
test decompress_mem_to_heap_miniz ... bench: 2,684,700 ns/iter (+/- 25,511)
test zlib_compress_fast ... bench: 10,932,038 ns/iter (+/- 42,275)
test zlib_decompress ... bench: 4,592,048 ns/iter (+/- 13,105)
Well, yes. We use 64bit buffers and even 64bit magic regardless of the machine's bitness. There are ifdef's in miniz to fallback into byte-by-byte style functions if it is on 32bits.
I don't think decompression uses much 64-bit buffers so I'm not sure why the difference is so massive there. Compression is more as expected (actually expected it to be more of a difference given that we use some 64-bit stuff there.
Also interesting is that using RUSTFLAGS= -C target-cpu=native made a significant difference for the default/high compression (from 68M to 54M for compress_mem_to_heap_high_oxide. Not a huge difference for the other tests.
Added some loops and local vars. Halfway to the miniz decompression time.
Nice work! Helped a fair bit on 32-bit machine too:
running 3 tests
test decompress ... bench: 3,832,808 ns/iter (+/- 18,652)
test decompress_mem_to_heap_miniz ... bench: 2,696,180 ns/iter (+/- 33,146)
test zlib_decompress ... bench: 4,079,228 ns/iter (+/- 23,350)
I don't think decompression uses much 64-bit buffers
Actually, I just realized the Action enum is 64-bit (as status and state are 32-bit values.) I don't know whether this makes an impact on 32-bit or not. EDIT: Changing data types to make it smaller didn't seem to make any difference on 32-bit.
With my last commit, the decompression benchmark is now faster than miniz on my machine!
running 4 tests
test decompress ... bench: 740,342 ns/iter (+/- 50,835)
test decompress_mem_to_heap_miniz ... bench: 822,743 ns/iter (+/- 34,919)
test decompress_mem_to_heap_oxide ... bench: 739,122 ns/iter (+/- 38,805)
test zlib_decompress ... bench: 838,384 ns/iter (+/- 12,919)
It also helped a fair bit on my rpi3 (still not quite as fast here though):
running 4 tests
test decompress ... bench: 4,280,570 ns/iter (+/- 10,271)
test decompress_mem_to_heap_miniz ... bench: 3,249,473 ns/iter (+/- 10,988)
test decompress_mem_to_heap_oxide ... bench: 4,309,158 ns/iter (+/- 25,855)
test zlib_decompress ... bench: 4,597,016 ns/iter (+/- 30,478)
Have you looked at libdeflate? It has the fastest decompression that I've seen: https://quixdb.github.io/squash-benchmark/unstable
This is somewhat off topic, but an impl period is looming (next monday), and I think miniz-oxide can be a perfect crate to contribute to: it should be possible to write comprehensive tests and benchmarks, so that cargo test and cargo bench give guarantee that it works correct, and works fast, and then you can easily throw the whole Rust community on optimzing the bench numbers, while keeping tests green.
I unfortunately don't have time to do the work required to make this crate super-friendly to contributors myself :(
@jrmuizel I am aware of the library, it's probably worth looking into for potential ideas for improving performance further. Granted, libdeflate doesn't support streaming, so the use case is a bit different.
@matklad Yeah, sounds like an idea, or at the very least, we could put in something in the looking for contributers thread on the forums.
@oyvindln these are very cool results! I will have time in the near future, so I will be able to hop back and help. I have quite some history of changes to go through first though :)
Sadly, benches on Travis CI are very unstable. For example, x 1.52 changed to x 1.21 even though decompression wasn't touched, while local benches varies only in 2-3% range for me.
Last few commits improved decompression performance by another ~5% over miniz:
name miniz:: ns/iter oxide:: ns/iter diff ns/iter diff % speedup
compress_bin_lvl_1 1,578,045 1,634,729 56,684 3.59% x 0.97
compress_bin_lvl_6 9,193,429 9,616,389 422,960 4.60% x 0.96
compress_bin_lvl_9 21,811,997 21,582,578 -229,419 -1.05% x 1.01
compress_code_lvl_1 1,199,268 1,129,718 -69,550 -5.80% x 1.06
compress_code_lvl_6 5,185,837 5,340,112 154,275 2.97% x 0.97
compress_code_lvl_9 7,042,597 7,159,632 117,035 1.66% x 0.98
compress_compressed_lvl_1 240,810 281,148 40,338 16.75% x 0.86
compress_compressed_lvl_6 1,448,752 1,569,350 120,598 8.32% x 0.92
compress_compressed_lvl_9 3,139,587 3,224,519 84,932 2.71% x 0.97
compress_short_lvl_1 4,552 20,976 16,424 360.81% x 0.22
compress_short_lvl_6 4,639 21,112 16,473 355.10% x 0.22
decompress_bin_lvl_1 970,568 806,123 -164,445 -16.94% x 1.20
decompress_bin_lvl_6 810,642 648,963 -161,679 -19.94% x 1.25
decompress_bin_lvl_9 802,212 637,111 -165,101 -20.58% x 1.26
decompress_code_lvl_1 732,376 610,690 -121,686 -16.62% x 1.20
decompress_code_lvl_6 521,172 395,393 -125,779 -24.13% x 1.32
decompress_code_lvl_9 519,140 392,498 -126,642 -24.39% x 1.32
decompress_compressed_lvl_1 169,651 120,782 -48,869 -28.81% x 1.40
decompress_compressed_lvl_6 240,749 193,696 -47,053 -19.54% x 1.24
decompress_compressed_lvl_9 241,462 193,670 -47,792 -19.79% x 1.25
decompress_short_lvl_1 5,211 3,376 -1,835 -35.21% x 1.54
The main change are that decompress_fast no longer uses read_bytes/read_bits to check for input end, but checks at the start that there is at least theoretical max of needed bytes available and then fulls out bit_buffer when needed without those checks. And special case for match of length 3.
I have an idea for the next improvement though: right now there is up to 14 bytes to be read from input buffer per decompress_fast cycle. They are read with 32-bit (or 16-bit for 32-bit systems) words, totaling in maximum of 4 (or 7) reads. Maybe if we read 16 (const) bytes with copy_from_slice it would optimize to one 128-bit read from buffer (in probably heap) right on stack to apply bit magic to. And then there also will be no if-checks to full the bit_buffer. But there will be a bit of dance to return unused bits.
Neat.
I added some extra checks to avoid some overflows/out of bounds issues that showed up in fuzzing and from people using this library after zip-rs adopted it as default. Some of it may have been redundant, though I was focused on fixing the bugs first. I was worried that it might have impacted performance, but it seems not.
As for optimizing more, I guess you just have to try it out, it may help.
Hey, so I'm writing a library that handles parsing some compressed XML files and noticed that miniz_oxide is responsible for the great majority of the runtime of my testcase program. This is pretty surprising as all of the processing is doing a fair amount of work (and plenty of allocations, it's not really optimized much at all). I don't have a lot to compare it against, so maybe it's not fully avoidable (it could just be down to the algorithm and not the implementation), but here's some performance data that I collected.
miniz_oxide::compress::core::compress_inner responsible for 90% of L1 cache misses in the program

And 32% of branch mispredictions (there's a lot of branches in my code, too, but nothing else comes close in terms of proportion)

I just recompiled my application w/ the flate2 "cloudflare_zlib" feature instead of the miniz_oxide backend, and it knocked nearly 30% off of the runtime. Possibly a good place to mine for optimizations.
The system zlib backend was ~15% faster, but this is measuring my whole application including all the XML parsing I'm doing, so the actual differences are probably bigger.