libjxl more speedups for fjxl decoding

Same as https://github.com/libjxl/libjxl/pull/1149 but now without the change to ClampedGradient, which wasn't helping much anyway and it apparently wasn't safe when the numbers use the full int32_t range, like in the lossless pfm conformance test case.

Mar 09 '22 10:03 jonsneyers

Can you add speed numbers with the new version?

Mar 09 '22 11:03 veluca93

Hm, strange, seems like compared to current git main branch, this doesn't really give much of a speedup at all anymore:

Before: 3072 x 2048, geomean: 39.96 MP/s [29.81, 41.69], 30 reps, 0 threads. 3072 x 2048, geomean: 68.28 MP/s [45.65, 72.18], 30 reps, 2 threads. 3072 x 2048, geomean: 102.01 MP/s [58.48, 109.59], 30 reps, 4 threads.

After: 3072 x 2048, geomean: 40.17 MP/s [27.99, 41.99], 30 reps, 0 threads. 3072 x 2048, geomean: 69.07 MP/s [48.04, 73.16], 30 reps, 2 threads. 3072 x 2048, geomean: 102.10 MP/s [64.97, 109.45], 30 reps, 4 threads.

Mar 09 '22 12:03 jonsneyers

Hm, strange, seems like compared to current git main branch, this doesn't really give much of a speedup at all anymore:

Before: 3072 x 2048, geomean: 39.96 MP/s [29.81, 41.69], 30 reps, 0 threads. 3072 x 2048, geomean: 68.28 MP/s [45.65, 72.18], 30 reps, 2 threads. 3072 x 2048, geomean: 102.01 MP/s [58.48, 109.59], 30 reps, 4 threads.

After: 3072 x 2048, geomean: 40.17 MP/s [27.99, 41.99], 30 reps, 0 threads. 3072 x 2048, geomean: 69.07 MP/s [48.04, 73.16], 30 reps, 2 threads. 3072 x 2048, geomean: 102.10 MP/s [64.97, 109.45], 30 reps, 4 threads.

I'd consider not doing it then :)

Mar 09 '22 13:03 veluca93

Hm, strange, seems like compared to current git main branch, this doesn't really give much of a speedup at all anymore: Before: 3072 x 2048, geomean: 39.96 MP/s [29.81, 41.69], 30 reps, 0 threads. 3072 x 2048, geomean: 68.28 MP/s [45.65, 72.18], 30 reps, 2 threads. 3072 x 2048, geomean: 102.01 MP/s [58.48, 109.59], 30 reps, 4 threads. After: 3072 x 2048, geomean: 40.17 MP/s [27.99, 41.99], 30 reps, 0 threads. 3072 x 2048, geomean: 69.07 MP/s [48.04, 73.16], 30 reps, 2 threads. 3072 x 2048, geomean: 102.10 MP/s [64.97, 109.45], 30 reps, 4 threads.

I'd consider not doing it then :)

Agreed, doesn't make much sense to merge this if there is no real speedup.

I wonder why I was seeing more substantial speed improvements before though (see the numbers in the previous PR), so maybe leave this PR open for a while to remind me to investigate what happened there.

Mar 09 '22 17:03 jonsneyers