RFE: encoder: add stride & various raw input formats support
The input format is not necessarily RGB or RGBA, and doing a pre-pass conversion can be quite costly (adding about 15-20% of total time from empirical study)
For simplicity reasons, I made sure not to break the existing API.
Those changes don't seem to affect the encoder performance in a significant way.
See also: https://github.com/Devolutions/IronRDP/pull/670
@aldanor hi, wdyt?
@aldanor ping :thanks:
@aldanor are you still maintaining the crate? thanks
The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15
Would be cool if it was possible to control the output format independently like in my pr.
And I don't really understand the stride argument. That should maybe be documented better.
The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15
You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.
Would be cool if it was possible to control the output format independently like in my pr.
Well, you can't really mix input and output formats freely. You need alpha channel input for alpha channel output. And QOI has only two output formats, rgb and rgba...
And I don't really understand the stride argument. That should maybe be documented better.
https://learn.microsoft.com/en-us/windows/win32/medfound/image-stride
You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.
True, didn't think about that. But what if someone wants to add an alpha channel?
You need alpha channel input for alpha channel output.
Adding an alpha channel with default values of 255 would be imho fine and not unexpected.
You need alpha channel input for alpha channel output.
Adding an alpha channel with default values of 255 would be imho fine and not unexpected.
Well, this use case goes beyond the typical simple encoding imo.
Why is RawChannels::bytes_per_pixel not public? There should either be another Encoder constructor without a stride argument or bytes_per_pixel should be public. Otherwise the Encoder::new_raw API is imho too clunky to use if you have no stride.
Why is
RawChannels::bytes_per_pixelnot public? There should either be another Encoder constructor without a stride argument orbytes_per_pixelshould be public. Otherwise theEncoder::new_rawAPI is imho too clunky to use if you have no stride.
We could make stride an Option. Alternatively, we could have an EncoderBuilder, but this might be overkill.
Stride could even be Option<NonZeroUsize>, but Option<usize> would be fine as well
@elmarco Added some comments, please lmk if you want to continue working on this pr or not?
Also, you'd need to rebase your branch after removing commits from this branch as lots of fixes are already on main:
- bench cli fix (plus --stream mode addon)
- fuzz package fix
- large buffer fix
- run-starting edgecase fix
- there's also commits in this pr that seemingly belong to the fork like renaming the package
There's also a few questions:
- can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?
- should there be any tests with the streaming mode? (it should just work, but still)
@aldanor thanks for the review
* can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?
This is the biggest problem, it seems I didn't benchmark properly. I get a -33% encoding perf. I am trying to understand where it comes from and how to fix it. thanks
I am trying to understand where it comes from and how to fix it
It might come from a double loop with now-unknown number of iterations in each preventing some common loop optimizations? I guess you can godbolt something equivalent.
Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.
Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.
it turns out it's the assert_eq!(), this adds a bunch of panic handling code and prevent some optimizations. I switched it to debug_assert!(). Now, no performance regression is observable (with perf stat).
assert_eq
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 279.8 381.8
MB/s 876.6 1196.4
--------------------------------------
encode Mp/s 224.4 206.9
MB/s 703.3 648.2
--------------------------------------
debug_assert_eq
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 280.2 381.9
MB/s 878.0 1196.6
--------------------------------------
encode Mp/s 225.6 208.9
MB/s 706.9 654.7
--------------------------------------
81c14c4b637c326a801e40701460b654d46e8fbe
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 278.7 364.3
MB/s 873.4 1141.6
--------------------------------------
encode Mp/s 218.3 254.8
MB/s 684.2 798.3
--------------------------------------
For me it's still slower with debug_assert_eq
did you compile with --release ?
I ran cargo run --release ../assets/ in the bench folder
(I switched to stable rust, as I get slightly different results with nightly which can create confusion)
@Joshix-1 try with the latest commit. You should not see performance loss (I actually observe a slight improvement something like 0.5%). Something is weird when enabling the "extra-source" feature, the encode_impl() don't get optimized the same way and we can observe -5-10%. It seems to be related to the Fn / closure somehow. But each format or fn specialization should be receiving an independant analysis and optimization no? It may be worth to ask/report to the compiler team. In the mean time, perhaps the "extra-source" feature flag is acceptable?
ce61683e8682edb68f8f040f4cbbce1eae143fb7 has a similar effect. Seems to be some weird case where the compiler doesn't optimize
@Joshix-1 I am really curious to know how you found about adding this extra E/Infallible !
Created a flamegraph and saw Try take a big chunk.
I tried to make a subset test case to report to the compiler without success atm.
@Joshix-1 what to do next?:
- use your workaround
- add the feature flag I proposed
- look for other solution (is stream writer really useful?)
- wait for a compiler fix
thanks
I think my commit could also improve the performance of master. Which would mean that this pr has to get even faster.
I have some ideas I want to test out. I'll do some more testing an benchmarking. Stream writer is imho useful, I don't see a reason to remove it
Could not improve performance of master. I would suggest using https://github.com/aldanor/qoi-rust/commit/50293f36d300511bbd5b16a6a1bb646ee63b98b8
With
[profile.release]
opt-level = 3
debug = true
in bench/Cargo.toml I get
$ cargo run --release ../assets/
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 274.2 383.2
MB/s 859.3 1200.9
--------------------------------------
encode Mp/s 220.7 260.4
MB/s 691.7 816.0
--------------------------------------
without the debug = true I get
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 280.9 370.7
MB/s 880.2 1161.4
--------------------------------------
encode Mp/s 226.5 252.6
MB/s 709.8 791.6
--------------------------------------
on master I get:
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
qoi.h qoi-rust
--------------------------------------
decode Mp/s 284.8 368.8
MB/s 892.4 1155.7
--------------------------------------
encode Mp/s 220.2 257.1
MB/s 690.0 805.7
--------------------------------------
But I don't really know anymore, it's all really weird and unexpected
indeed, I am ok with https://github.com/aldanor/qoi-rust/commit/50293f36d300511bbd5b16a6a1bb646ee63b98b8, we can later revert it.
Why revert it later? The change makes sense. Infallible methods shouldn't return results that could contain errors
The compiler should be able to infer that BytesMut Writer is in fact Infallable when inlining, I think that's what it's doing when "extra-source" is off.