tweaks to modular effort tradeoffs
Also related to https://github.com/libjxl/libjxl/pull/4232 and https://github.com/libjxl/libjxl/pull/4154
Chipping away at the Pareto front, these tweaks aim to (slightly) improve the effort/density trade-offs.
Changes:
- After the bugfix at https://github.com/libjxl/libjxl/pull/4154 that caused
max_property_valuesto actually get respected, we can bump up the number of property value quantization buckets at all effort settings (which improves density at the cost of some speed, though overall this has little speed impact) - The
nb_repeatsparameter (which can be configured via the API but by default is just 0.5 at all efforts) is now modulated by effort too, i.e. lower effort also uses fewer samples for MA tree learning. This speeds up lower efforts and slows down higher efforts, remaining neutral at default effort. - Simplified/improved the tree learning heuristics a little since the logic was a bit wonky:
adds_wpcould be false even though the candidate split does use the WP (when the parent node already used the WP), which can lead to selecting a suboptimal split because of thefast_decode_multiplierpreferring a slightly worse split withadds_wp == falseto a better split withadds_wp == true(which doesn't make sense if in both options, the WP is used anyway). The simpler logic is slightly faster and denser (the difference is small though).
Before: (jyrki31 corpus)
31 images
Encoding kPixels Bytes BPP E MP/s D MP/s Max norm SSIMULACRA2 PSNR pnorm BPP*pnorm QABPP Bugs
----------------------------------------------------------------------------------------------------------------------------------------
jxl:d0:4 13270 17162582 10.3463459 5.225 49.279 nan 100.00000000 99.99 0.00000000 0.000000000000 10.346 0
jxl:d0:5 13270 16971925 10.2314097 2.893 39.969 nan 100.00000000 99.99 0.00000000 0.000000000000 10.231 0
jxl:d0:6 13270 16860935 10.1645001 1.849 35.470 nan 100.00000000 99.99 0.00000000 0.000000000000 10.165 0
jxl:d0:7 13270 16638016 10.0301149 1.188 31.430 nan 100.00000000 99.99 0.00000000 0.000000000000 10.030 0
jxl:d0:8 13270 16534367 9.9676308 0.319 31.807 nan 100.00000000 99.99 0.00000000 0.000000000000 9.968 0
jxl:d0:9 13270 16458308 9.9217791 0.235 29.942 nan 100.00000000 99.99 0.00000000 0.000000000000 9.922 0
Aggregate: 13270 16769172 10.1091812 1.164 35.760 0.00000000 100.00000000 99.99 0.00000000 0.000000000000 10.109 0
After:
31 images
Encoding kPixels Bytes BPP E MP/s D MP/s Max norm SSIMULACRA2 PSNR pnorm BPP*pnorm QABPP Bugs
----------------------------------------------------------------------------------------------------------------------------------------
jxl:d0:4 13270 17117248 10.3190166 6.899 44.328 nan 100.00000000 99.99 0.00000000 0.000000000000 10.319 0
jxl:d0:5 13270 16934902 10.2090906 3.290 39.130 nan 100.00000000 99.99 0.00000000 0.000000000000 10.209 0
jxl:d0:6 13270 16856572 10.1618699 2.078 35.075 nan 100.00000000 99.99 0.00000000 0.000000000000 10.162 0
jxl:d0:7 13270 16635589 10.0286518 1.167 32.788 nan 100.00000000 99.99 0.00000000 0.000000000000 10.029 0
jxl:d0:8 13270 16532724 9.9666403 0.305 32.073 nan 100.00000000 99.99 0.00000000 0.000000000000 9.967 0
jxl:d0:9 13270 16444861 9.9136727 0.215 30.934 nan 100.00000000 99.99 0.00000000 0.000000000000 9.914 0
Aggregate: 13270 16751992 10.0988244 1.238 35.433 0.00000000 100.00000000 99.99 0.00000000 0.000000000000 10.099 0
TL;DR: e4-e6 become faster and slightly denser (so just better), e7 stays about the same (a tiny bit denser and slower, maybe), e8+ become slightly denser and slower.
This reminds me, I was going to try re-enabling P15 at effort 9. It was previously disabled because e9 was slower than e10, but that only applies to images under 2048 x 2048 where Local MA trees (and effectively multithreading) is disabled.
Instead of that though, we might explore a wider predictor overhaul. Adding new options like P14, that try a subset of the most commonly used predictors. Possibly even replace P14, due to how slow Weighted is for en/decoding with marginal improvement over Gradient in most cases, but that will need to be tested and discussed.
I did some testing recently, and I think -E 1 could be enabled at effort 9, with faster decoding level 2 defaulting it back to 0. It has a small en/decode speed penalty, but the density improvement can be better than -P 15, which is enabled at effort 10.
It should match well to the higher MA percent in this PR, and uses another feature which is disabled by default.
I did some testing recently, and I think
-E 1could be enabled at effort 9, with faster decoding level 2 defaulting it back to 0. It has a small en/decode speed penalty, but the density improvement can be better than-P 15, which is enabled at effort 10.It should match well to the higher MA percent in this PR, and uses another feature which is disabled by default.
That could make sense, yes. Let's do it in another PR though.
Rebased this.
Now the performance impact is as follows:
Before:
31 images
Encoding kPixels Bytes BPP E MP/s D MP/s Max norm SSIMULACRA2 PSNR pnorm BPP*pnorm QABPP Bugs
----------------------------------------------------------------------------------------------------------------------------------------
jxl:d0:4 13270 17162620 10.3463688 5.390 58.479 nan 100.00000000 99.99 0.00000000 0.000000000000 10.346 0
jxl:d0:5 13270 16908996 10.1934733 3.187 46.673 nan 100.00000000 99.99 0.00000000 0.000000000000 10.193 0
jxl:d0:6 13270 16797889 10.1264932 1.897 40.528 nan 100.00000000 99.99 0.00000000 0.000000000000 10.126 0
jxl:d0:7 13270 16625029 10.0222858 1.181 34.947 nan 100.00000000 99.99 0.00000000 0.000000000000 10.022 0
jxl:d0:8 13270 16478362 9.9338686 0.380 35.581 nan 100.00000000 99.99 0.00000000 0.000000000000 9.934 0
jxl:d0:9 13270 16385839 9.8780917 0.263 33.514 nan 100.00000000 99.99 0.00000000 0.000000000000 9.878 0
Aggregate: 13270 16724387 10.0821830 1.251 40.795 0.00000000 100.00000000 99.99 0.00000000 0.000000000000 10.082 0
After:
31 images
Encoding kPixels Bytes BPP E MP/s D MP/s Max norm SSIMULACRA2 PSNR pnorm BPP*pnorm QABPP Bugs
----------------------------------------------------------------------------------------------------------------------------------------
jxl:d0:4 13270 17117175 10.3189726 7.764 54.181 nan 100.00000000 99.99 0.00000000 0.000000000000 10.319 0
jxl:d0:5 13270 16872864 10.1716914 3.956 46.188 nan 100.00000000 99.99 0.00000000 0.000000000000 10.172 0
jxl:d0:6 13270 16793526 10.1238630 2.337 40.800 nan 100.00000000 99.99 0.00000000 0.000000000000 10.124 0
jxl:d0:7 13270 16622723 10.0208956 1.249 36.068 nan 100.00000000 99.99 0.00000000 0.000000000000 10.021 0
jxl:d0:8 13270 16474269 9.9314011 0.380 35.807 nan 100.00000000 99.99 0.00000000 0.000000000000 9.931 0
jxl:d0:9 13270 16371339 9.8693505 0.249 34.870 nan 100.00000000 99.99 0.00000000 0.000000000000 9.869 0
Aggregate: 13270 16706772 10.0715641 1.429 40.778 0.00000000 100.00000000 99.99 0.00000000 0.000000000000 10.072 0
The 'before' is now better than the 'after' was before (other improvements have been made in the mean time), but it looks like this is still an improvement, Pareto-wise. At every effort setting, compression slightly improves, and encode speed either improves or remains similar.
Is there any intent/reason behind kSquirrel not having a value defined for nb_repeats or was this just a simple oversight/mistake?
nb_repeats is capped at 1 in this PR, so I'm not sure why it also has 1.1 set for Kitten and 1.3 for Glacier.
I know higher values should increase the quantization percent, but then cparams_.options.nb_repeats = std::min(1.0f, cparams_.options.nb_repeats); should cap at 10, not 1.
https://github.com/libjxl/libjxl/blob/7cac2ac860e41f7f4199b73508490016a8af204c/lib/jxl/modular/encoding/enc_ma.cc#L979