flate2-rs
flate2-rs copied to clipboard
Extremely slow performance in debug mode with default backend
Hi!
In a particular project, I use flate2 to decompress a ~50MB gzipped tarfile.
Whilst in production the project will be built in release mode, the integration tests are performed using debug builds, and when iterating locally when developing, I use debug builds too.
In addition, due to the nature of the project (a Cloud Native Buildpack that's targetting x86_64 Linux), these integration tests/any manual testing have to run inside a x86_64 Docker container. After recently obtaining a new Macbook Pro M1 Max (which has to use Docker's qemu emulation for x86_64 Docker images), I was surprised to see the integration tests take considerably longer than they used to on my much older machine.
Investigating, it turns out that when using the default flate2 backend of miniz_oxide and the below testcase:
- debug builds are ~30x slower than release builds, when run on ARM64 natively
- debug builds are ~60x slower than release builds, when run under Docker's qemu emulation
In contrast, when using the zlib or zlib-ng-compat backends, debug builds are only 2-4x slower than release builds.
Whilst debug builds are expected to be slower than release builds, I was quite surprised that they were 30-60x slower for this crate using the default backend.
I'm presuming there's not much that can be done to improve performance of miniz_oxide for debug builds, however I was wondering if it would be worth mentioning the performance issues in this crates docs, particularly given that:
(a) switching backends makes such a difference here,
(b) the docs currently suggest that the default backend is mostly "good enough" (and otherwise I would have tried another backend sooner):
There’s various tradeoffs associated with each implementation, but in general you probably won’t have to tweak the defaults.
(from https://docs.rs/flate2/latest/flate2/#implementation)
It was only later that I noticed this section in the readme (that's not on docs.rs), that seemed to imply the zlib-ng backend was actually faster:
https://github.com/rust-lang/flate2-rs#backends
Testcase:
use flate2::read::GzDecoder;
use std::fs::File;
fn main() -> Result<(), std::io::Error> {
// Archive is from:
// https://heroku-buildpack-python.s3.amazonaws.com/heroku-20/runtimes/python-3.10.3.tar.gz
let archive = File::open("python-3.10.3.tar.gz")?;
let mut destination = tempfile::tempfile()?;
let mut decoder = GzDecoder::new(archive);
std::io::copy(&mut decoder, &mut destination)?;
Ok(())
}
[package]
name = "testcase-flate2-debug"
version = "0.1.0"
edition = "2021"
[dependencies]
# For default backend
flate2 = "1.0.22"
# For alternate backends
# flate2 = { version = "1.0.22", features = ["zlib-ng-compat"], default-features = false }
# flate2 = { version = "1.0.22", features = ["zlib"], default-features = false }
tempfile = "3.3.0"
Results:
| Backend | Architecture | Wall time w/release build | Wall time w/debug build | Debug slowdown |
|---|---|---|---|---|
miniz_oxide (default) |
Native ARM64 | 0.69s | 21.55s | 31x |
miniz_oxide (default) |
AMD64 under qemu | 3.41s | 207s | 60x |
zlib |
Native ARM64 | 0.65s | 1.26s | 1.9x |
zlib |
AMD64 under qemu | 2.19s | 9.22s | 4.2x |
zlib-ng-compat |
Native ARM64 | 0.55s | 1.43s | 2.6x |
zlib-ng-compat |
AMD64 under qemu | ??? | ??? | ??? |
(The missing timings for zlib-ng-compat under qemu is due to cross-compilation of zlib-ng currently failing: https://github.com/rust-lang/libz-sys/issues/93)
Yeah rust in debug mode is going to be much much much slower than anything written in C due to the nature of the languages. (and I'm not sure whether system zlib will even be used in debug/no optimization mode.)
Turning on the first level of optimizations in debug mode may help a fair bit, may be some other workarounds to avoid compiling all deps in debug mode or using different opts for main project/deps but not sure.
I wasn't able to get perf working inside a QEMU'd Docker container (due to PERF_FLAG_FD_CLOEXEC not implemented errors), so wasn't able to profile the 207s chronic case unfortunately.
However, this is a flamegraph for a native ARM64 debug build (the 21.55s entry in the table above): (It has to be downloaded for the interactivity to work; hosted on GitHub that is disabled)
As can be seen, 77% of the profile is in Adler32::compute():
https://github.com/jonas-schievink/adler/blob/a94f525f62698d699d1fb3cc9112db8c35662b16/src/algo.rs#L5-L107
With 60% of the total profile within the implementation of AddAssign<Self> for U32X4 (used from Adler32::compute()):
https://github.com/jonas-schievink/adler/blob/a94f525f62698d699d1fb3cc9112db8c35662b16/src/algo.rs#L124-L130
You can override opt-level for certain crates in debug mode, see https://doc.rust-lang.org/cargo/reference/profiles.html#overrides, add the following to Cargo.toml should make it faster.
[profile.dev.package.miniz_oxide]
opt-level = 3
Closing as this is more of a Rust issue rather than a flate2-specific one.