pyperformance icon indicating copy to clipboard operation
pyperformance copied to clipboard

Benchmark stdlib compression code

Open emmatyping opened this issue 7 months ago • 5 comments

At PyConUS, I was chatting with @gpshead about adding compression benchmarks. While a lot of the "heavy lifting" of compression happens in the libraries CPython binds (zlib, liblzma, etc.), the handling of output buffers in CPython has a significant impact in performance, and is something we don't have a lot of visibility into the performance of.

One of the better known cross-algorithm compression benchmarks I'm aware of is lzbench, which tests compression performance of the Silesia compression corpus across many algorithms. I figure running compression benchmarks at varied settings on Silesia would provide a good starting point for benchmarking the output buffer and other binding code.

emmatyping avatar May 22 '25 21:05 emmatyping

We shouldn't care about the underlying compression library's own performance, but running them all at their least compression/fastest modes would give us an idea of our own overhead.

Some may have a "0" mode for no compression - in others "0" means "default" or is an error IIRC - but otherwise they all have a concept of fast, just universally using a level of "1" regardless of algorithm is probably sufficient enough that we shouldn't overthink it.

gpshead avatar May 22 '25 22:05 gpshead

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

hauntsaninja avatar May 22 '25 23:05 hauntsaninja

We shouldn't care about the underlying compression library's own performance

Absolutely agree. I think I'm imagining we'd want to keep the underlying compression library versions as consistent as possible to avoid benchmarking them. Then we can compare across changes, which I think is the main benefit. One area that I'd like to make sure doesn't regress (and potentially look at tweaking/improving) is the output buffer code: https://github.com/python/cpython/blob/main/Include/internal/pycore_blocks_output_buffer.h, which can have a significant impact on performance.

running them all at their least compression/fastest modes would give us an idea of our own overhead.

I expect that this may not be a representative benchmark because the amount of output matters here, so choosing very low compression levels will probably exaggerate our overhead. It also would not properly benchmark the output buffer code as the buffer sizes could be significantly larger than real world scenarios.

What I'd like to see is benchmarking where the library version stays the same and we vary the size of data, compression level, and potentially some compression flags. The point is not to compare among these configurations, but rather to be used to compare between changes in stdlib code across several different usage scenarios.

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

I think that could also be interesting, but it uses unstable libzstd APIs which I do not want the stdlib to use, and builds against the latest version of libzstd. I expect comparisons there might be tricky.

emmatyping avatar May 23 '25 00:05 emmatyping

Also I just re-read my original message and realized it could be read to mean that I want to run lzbench or tests like it to check the performance of underlying compression libraries. That's not what I want! Sorry for any confusion. I was merely calling it out as prior art/inspiration and that the Silesia corpus is probably a good dataset to use when we write our own benchmarks

emmatyping avatar May 23 '25 00:05 emmatyping

Revisiting this, I guess one tricky thing to figure out is how to handle benchmark data. Even xz compressed, the silesia corpus is 47MB (65MB zipped), that seems somewhat large to check in to git. I suppose we can start with benchmarking smaller amounts of data, but the Silesia corpus is the most commonly used and has a good data mix for benchmarking compression performance.

emmatyping avatar Sep 01 '25 18:09 emmatyping