jxl-oxide
jxl-oxide copied to clipboard
`jxl-oxide` spends 62% of the time in `copy_nonoverlapping`
I have converted a large image I had lying around to a slightly lossy JPEG XL with cjxl -d 1:
gast1verandering2_015_d1.jxl.gz
Decoding it with djxl --disable_output takes 50ms. With jxl-oxide CLI, ignoring the time to encode and write the PNG, the decoding process takes 300ms, or 6x as long.
I've profiled the decoding process with samply, you can browse the results here: https://share.firefox.dev/48WOaNU
If you look at the inverted call stacks, 62% of the total time is spent in core::intrinsics::copy_nonoverlapping, literally just copying memory around.
Looking at the flame graph, the jxl_render::filter::epf::apply_epf function takes up 38% of the time, with its runtime almost entirely consists of copying and freeing memory. jxl_render::region::ImageWithRegion::try_clone also takes 30% of the time and consists purely of copying memory.
Eliminating these copies would boost performance by nearly 3x.
Optimizing edge-preserving filter is something I've tried before (#80), but it surely seems insufficient. Maybe I'll need a different approach/strategy to apply those filters...
The actual filtering in EPF only accounts for 5% of the total decoding runtime. It's the memory copying that it performs that takes up the vast majority of the time.
Could it be changed to accept &mut input and operate in-place instead of performing multiple copies along the way?
If you cannot perform the operation entirely in-place, it's fine to use a small scratch buffer (row-sized?) and copy to it and back to &mut input, since the small scratch buffer will always be in the CPU cache.
I mean, it needs more optimization including those copies and memory accesses. It should be more cache friendly indeed, I'll consider keeping small scratch buffer.
Now that I've profiled the execution with perf so I could see into the kernel, I see that copy_nonoverlapping itself took very little time and it was only slow because it called into the kernel that had to allocate more memory. The time spent inside copy_nonoverlapping itself is negligible. Profile showing it: https://share.firefox.dev/4dpr9Gw
This is just a symptom of #302, so I'm closing this in favor of that issue.
Sorry for the accidental misdirection!