qoi-rust icon indicating copy to clipboard operation
qoi-rust copied to clipboard

RFE: encoder: add stride & various raw input formats support

Open elmarco opened this issue 10 months ago • 1 comments

The input format is not necessarily RGB or RGBA, and doing a pre-pass conversion can be quite costly (adding about 15-20% of total time from empirical study)

For simplicity reasons, I made sure not to break the existing API.

Those changes don't seem to affect the encoder performance in a significant way.

elmarco avatar Feb 12 '25 14:02 elmarco

See also: https://github.com/Devolutions/IronRDP/pull/670

elmarco avatar Feb 13 '25 11:02 elmarco

@aldanor hi, wdyt?

elmarco avatar Mar 18 '25 15:03 elmarco

@aldanor ping :thanks:

elmarco avatar Apr 15 '25 19:04 elmarco

@aldanor are you still maintaining the crate? thanks

elmarco avatar May 09 '25 10:05 elmarco

The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15

Would be cool if it was possible to control the output format independently like in my pr.

And I don't really understand the stride argument. That should maybe be documented better.

Joshix-1 avatar May 22 '25 20:05 Joshix-1

The api feels kinda confusing too me. It would be cool to control Input format and output format independently. I have the use case that I have an image with RGBA data where I know that the image has no translucency, so I want to save it as RGB. I don't know how I would do this with this PR. Which is why I created #15

You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.

Would be cool if it was possible to control the output format independently like in my pr.

Well, you can't really mix input and output formats freely. You need alpha channel input for alpha channel output. And QOI has only two output formats, rgb and rgba...

And I don't really understand the stride argument. That should maybe be documented better.

https://learn.microsoft.com/en-us/windows/win32/medfound/image-stride

elmarco avatar May 24 '25 17:05 elmarco

You can use the RawChannels::Rgbx or Bgrx, or Xbgr etc.. from this PR for that.

True, didn't think about that. But what if someone wants to add an alpha channel?

You need alpha channel input for alpha channel output.

Adding an alpha channel with default values of 255 would be imho fine and not unexpected.

Joshix-1 avatar May 25 '25 15:05 Joshix-1

You need alpha channel input for alpha channel output.

Adding an alpha channel with default values of 255 would be imho fine and not unexpected.

Well, this use case goes beyond the typical simple encoding imo.

elmarco avatar May 27 '25 17:05 elmarco

Why is RawChannels::bytes_per_pixel not public? There should either be another Encoder constructor without a stride argument or bytes_per_pixel should be public. Otherwise the Encoder::new_raw API is imho too clunky to use if you have no stride.

Joshix-1 avatar May 28 '25 14:05 Joshix-1

Why is RawChannels::bytes_per_pixel not public? There should either be another Encoder constructor without a stride argument or bytes_per_pixel should be public. Otherwise the Encoder::new_raw API is imho too clunky to use if you have no stride.

We could make stride an Option. Alternatively, we could have an EncoderBuilder, but this might be overkill.

elmarco avatar May 28 '25 14:05 elmarco

Stride could even be Option<NonZeroUsize>, but Option<usize> would be fine as well

Joshix-1 avatar May 28 '25 14:05 Joshix-1

@elmarco Added some comments, please lmk if you want to continue working on this pr or not?

Also, you'd need to rebase your branch after removing commits from this branch as lots of fixes are already on main:

  • bench cli fix (plus --stream mode addon)
  • fuzz package fix
  • large buffer fix
  • run-starting edgecase fix
  • there's also commits in this pr that seemingly belong to the fork like renaming the package

There's also a few questions:

  • can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?
  • should there be any tests with the streaming mode? (it should just work, but still)

aldanor avatar Jul 28 '25 00:07 aldanor

@aldanor thanks for the review

* can you run the benchmark before/after on the encoder and post the results, to make sure that double nested loop in the encoder isn't affecting things much?

This is the biggest problem, it seems I didn't benchmark properly. I get a -33% encoding perf. I am trying to understand where it comes from and how to fix it. thanks

elmarco avatar Jul 28 '25 12:07 elmarco

I am trying to understand where it comes from and how to fix it

It might come from a double loop with now-unknown number of iterations in each preventing some common loop optimizations? I guess you can godbolt something equivalent.

Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.

aldanor avatar Jul 28 '25 13:07 aldanor

Yea, -33% is pretty bad for sure... -3% would still be not nice but within noise bounds.

it turns out it's the assert_eq!(), this adds a bunch of panic handling code and prevent some optimizations. I switched it to debug_assert!(). Now, no performance regression is observable (with perf stat).

elmarco avatar Jul 28 '25 17:07 elmarco

assert_eq

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        279.8       381.8
         MB/s        876.6      1196.4
--------------------------------------
encode   Mp/s        224.4       206.9
         MB/s        703.3       648.2
--------------------------------------

debug_assert_eq

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        280.2       381.9
         MB/s        878.0      1196.6
--------------------------------------
encode   Mp/s        225.6       208.9
         MB/s        706.9       654.7
--------------------------------------

81c14c4b637c326a801e40701460b654d46e8fbe

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        278.7       364.3
         MB/s        873.4      1141.6
--------------------------------------
encode   Mp/s        218.3       254.8
         MB/s        684.2       798.3
--------------------------------------

For me it's still slower with debug_assert_eq

Joshix-1 avatar Jul 28 '25 17:07 Joshix-1

did you compile with --release ?

elmarco avatar Jul 28 '25 17:07 elmarco

I ran cargo run --release ../assets/ in the bench folder

Joshix-1 avatar Jul 28 '25 17:07 Joshix-1

(I switched to stable rust, as I get slightly different results with nightly which can create confusion)

@Joshix-1 try with the latest commit. You should not see performance loss (I actually observe a slight improvement something like 0.5%). Something is weird when enabling the "extra-source" feature, the encode_impl() don't get optimized the same way and we can observe -5-10%. It seems to be related to the Fn / closure somehow. But each format or fn specialization should be receiving an independant analysis and optimization no? It may be worth to ask/report to the compiler team. In the mean time, perhaps the "extra-source" feature flag is acceptable?

elmarco avatar Jul 28 '25 18:07 elmarco

ce61683e8682edb68f8f040f4cbbce1eae143fb7 has a similar effect. Seems to be some weird case where the compiler doesn't optimize

Joshix-1 avatar Jul 28 '25 18:07 Joshix-1

@Joshix-1 I am really curious to know how you found about adding this extra E/Infallible !

elmarco avatar Jul 28 '25 19:07 elmarco

Created a flamegraph and saw Try take a big chunk. grafik

Joshix-1 avatar Jul 28 '25 19:07 Joshix-1

I tried to make a subset test case to report to the compiler without success atm.

@Joshix-1 what to do next?:

  • use your workaround
  • add the feature flag I proposed
  • look for other solution (is stream writer really useful?)
  • wait for a compiler fix

thanks

elmarco avatar Jul 29 '25 08:07 elmarco

I think my commit could also improve the performance of master. Which would mean that this pr has to get even faster.

I have some ideas I want to test out. I'll do some more testing an benchmarking. Stream writer is imho useful, I don't see a reason to remove it

Joshix-1 avatar Jul 29 '25 12:07 Joshix-1

Could not improve performance of master. I would suggest using https://github.com/aldanor/qoi-rust/commit/50293f36d300511bbd5b16a6a1bb646ee63b98b8

With

[profile.release]
opt-level = 3
debug = true

in bench/Cargo.toml I get

$ cargo run --release ../assets/
Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        274.2       383.2
         MB/s        859.3      1200.9
--------------------------------------
encode   Mp/s        220.7       260.4
         MB/s        691.7       816.0
--------------------------------------

without the debug = true I get

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        280.9       370.7
         MB/s        880.2      1161.4
--------------------------------------
encode   Mp/s        226.5       252.6
         MB/s        709.8       791.6
--------------------------------------

on master I get:

Overall results: (7 images, 7.78 MB raw, 2.48 MP):
--------------------------------------
                     qoi.h    qoi-rust
--------------------------------------
decode   Mp/s        284.8       368.8
         MB/s        892.4      1155.7
--------------------------------------
encode   Mp/s        220.2       257.1
         MB/s        690.0       805.7
--------------------------------------

But I don't really know anymore, it's all really weird and unexpected

Joshix-1 avatar Jul 29 '25 17:07 Joshix-1

indeed, I am ok with https://github.com/aldanor/qoi-rust/commit/50293f36d300511bbd5b16a6a1bb646ee63b98b8, we can later revert it.

elmarco avatar Jul 29 '25 18:07 elmarco

Why revert it later? The change makes sense. Infallible methods shouldn't return results that could contain errors

Joshix-1 avatar Jul 29 '25 19:07 Joshix-1

The compiler should be able to infer that BytesMut Writer is in fact Infallable when inlining, I think that's what it's doing when "extra-source" is off.

elmarco avatar Jul 29 '25 20:07 elmarco