image-png
image-png copied to clipboard
Performance: Require `BufRead` instead of just `Read` for inputs.
PTAL?
This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for BufReader is 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle
From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) BufReader<&[u8]> over a plain &[u8]. I don't expect anyone would actually use that type, but BufReader<File> is probably pretty common and may approximate it.
I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images? If so, we could think about changes to control the size of the buffer based on the image size (likely by switching away from internally using a BufReader, and instead having a standalone buffer for reading IDAT chunks)
This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for
BufReaderis 8KB, so that'll likely be a common choice for users who don't have the full image data already loaded into memory. That makes me a bit nervous since it adds a performance parameter out of our control and means there will be edge cases around small buffers that are hard to handle
I wonder if adding a benchmark that uses BufReader would lower the risk a little bit? For example, we could add one additional benchmark testcase: noncompressed-128x128-with-default-bufreader.png. WDYT?
From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized)
BufReader<&[u8]>over a plain&[u8]. I don't expect anyone would actually use that type, butBufReader<File>is probably pretty common and may approximate it.
I think your experiment supports merging this PR:
- Your experiment shows that an extra
BufReader<...>leads to a performance regression (you say that when you compareBufReader<&[u8]>vs&[u8]the version withBufReaderis slower by 1-2%) - Your experiment does not approximate what would happen for
BufReader<File>(since there is noBufReader<...>in&[u8]). For such approximation, I think that you can try comparingBufReader<u8>with 32kB of buffer (the old default) vsBufReader<u8>with 8kB of buffer (the new default fromstd).
I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images?
I think the data from the commit message suggests the opposite - the performance gains were much more pronounced for bigger images. I speculate that this is because the savings come from avoiding copying all of the image data, rather than coming from the initial, fixed-size data or memset. According to the commit message the savings looked as follows:
- noncompressed-8x8.png:
- [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05)
- [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05)
- noncompressed-128x128.png:
- [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05)
- [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)
Sorry, I should have been more clear. The performance regression I found only happens with this PR but not on the main branch. That was what concerned me.
Running a different experiment, I tried making this patch to the benchmarks:
--- a/benches/decoder.rs
+++ b/benches/decoder.rs
@@ -48,7 +48,7 @@ fn bench_file(c: &mut Criterion, data: Vec<u8>, name: String) {
group.throughput(Throughput::Bytes(info.buffer_size() as u64));
group.bench_with_input(name, &data, |b, data| {
b.iter(|| {
- let decoder = Decoder::new(data.as_slice());
+ let decoder = Decoder::new(std::io::BufReader::new(data.as_slice()));
let mut decoder = decoder.read_info().unwrap();
decoder.next_frame(&mut image).unwrap();
})
I then ran each benchmark against the main branch and again against this PR:
decode/kodim17.png
time: [+1.6160% +1.8703% +2.1665%] (p = 0.00 < 0.05)
thrpt: [-2.1206% -1.8359% -1.5903%]
decode/kodim07.png
time: [+1.3616% +1.6080% +1.9906%] (p = 0.00 < 0.05)
thrpt: [-1.9517% -1.5826% -1.3433%]
decode/kodim02.png
time: [+2.9415% +3.2626% +3.5050%] (p = 0.00 < 0.05)
thrpt: [-3.3863% -3.1595% -2.8574%]
decode/kodim23.png
time: [-0.3615% +0.5706% +1.3460%] (p = 0.21 > 0.05)
thrpt: [-1.3281% -0.5674% +0.3629%]
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png
time: [+6.5478% +6.8271% +7.1205%] (p = 0.00 < 0.05)
thrpt: [-6.6472% -6.3908% -6.1454%]
decode/Transparency.png time:
time: [+1.7512% +2.0166% +2.3704%] (p = 0.00 < 0.05)
thrpt: [-2.3156% -1.9768% -1.7211%]
decode/generated-png:noncompressed-8x8.png
time: [-0.0576% +0.1127% +0.2824%] (p = 0.20 > 0.05)
thrpt: [-0.2816% -0.1125% +0.0576%]
decode/generated-png:noncompressed-128x128.png
time: [+21.104% +21.693% +22.380%] (p = 0.00 < 0.05)
thrpt: [-18.287% -17.826% -17.427%]
I'll also note I'm somewhat biased here. This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound. If possible, I'd rather find a way to avoid that without sacrificing performance
@fintelia - could you please upload your benchmarking setup to a github repo + point out the hashes of the before/after commits? This would help me understand what is being compared in your experiment.
Is it maybe like this?:
- Before: origin/master (i.e. f80dfe93bc67e7a515aea9216af2a9feeeb696df) +
benches/decoder.rschanges from your comment above (resulting in 2BufReaders: first 8kB buffering and then 32kB buffering) - After: this PR (i.e. f24d4ffe43ce929dfcc1759d419e20aca37203c5) +
benches/decoder.rschanges from your comment above (only 1BufReaderwith 8kB buffer)
This API change would have some ripple effects for parts of the main image crate's API, probably requiring that crate to either unconditionally wrap PNG input streams in a BufReader or require all users of all decoders to adhere to the BufRead trait bound.
Yes, this is true. I think I'd prefer to require all users of image::codecs::png::PngDecoder to adhere to the BufRead trait bound (but not necessarily "all decoders").
I am also biased - the speed of decoding of in-memory PNGs is most important to me, because this is the form of PNG input in Chromium's //ui/gfx and in Chromium's Blink (in fact, if you squint your eyes, then blink::SegmentReader::GetSomeData is a little bit like BufReader::fill_buf, although there is no equivalent of BufReader::consume and blink::PngImageReader has to track "consumption" by managing the position argument / read_offset_ field).
I think that understanding the performance impact of this PR is important, so let's continue the discussion here. At the same time, maybe the 2 PRs at https://github.com/image-rs/image-png/pull/429 and https://github.com/image-rs/image-png/pull/428 are less controversial and maybe this is where we can focus the review efforts for now?
Landing those other 2 PRs first may be desirable to:
- Magnify the impact of the
BufReaderchanges (and make it more easily measurable/detectable) by landing the other improvements first. This should help to magnify any negative as well as positive impact. - Bundle the breaking changes together, as you've suggested elsewhere
BTW, one additional argument for not merging this PR yet is that so far Chromium has not landed any code that uses the png crate. Hopefully this will happen soon, but I can understand the desire to prioritize the experience of existing consumers of the png crate (over the hypothetical, future experience of aspirational consumers - like Chromium).
Still, I am curious about the benchmarking results above, and would like to debug and understand them better. (Although as I said before, this seems lower priority than discussing the other 2, less controversial PRs.)
For full transparency, let me share some recent observations. I have rebased this PR on top of the latest changes, but when rerunning the benchmarks I've realized that for some testcases (e.g. noncompressed, 2048x2048 image, split across 64kB IDAT chunks) I observe a significant regression. The binary size and the number of instructions goes down (6550545993 instructions => 6358944886 instructions - 3% improvement), but the stalled backend cycles increase from 5.37% backend cycles idle to 40.61% backend cycles idle (which translates into a regression of the overall runtime). I think that getting rid of the intermediate BufReader means that decompression can consume much bigger chunks of input at a time (whole IDAT - 64kB, instead of the size of the intermediate BufReader - 32kB) and I am guessing that this change is somehow negatively impacting hardware-driven memory prefetching. I don't yet understand what is happening exactly, what actions we can take, and what it means for this PR.
Any chance your CPU has a 64KB L1 cache? The current approach does three copies with size=32KB: input -> BufReader -> out_buffer -> data_stream, which should stay inside the L1 cache.
This PR changes it to two copies input -> out_buffer -> data_stream, but in the process makes the copy size depend on the underlying image's IDAT sizes. Which with 64KB IDATs means that the working set for each copy becomes 128KB
Let me convert this PR to a "draft", so that it won't get accidentally merged before we understand the performance impact better. There are multiple factors at play, so I think that (instead of continuing the discussion here) I'll try to post something to https://github.com/image-rs/image-png/discussions/416.