Optimize NEON encoding for high dynamic range blocks
We can improve the encoding performance on Apple silicon configurations with the following enhancements
- Replace the switch with a for loop
- Invert qlp_factors & remove else conditions as branches not taken are returned out
13% - 17% improvement depending on compiler, configuration and .wav
Clang 16 or GCC 13 necessary
| Platform | MBP18,2 | Mac14,12 | AMD Ryzen 9 7900X x 24 |
|---|---|---|---|
| Compiler | GCC 13 ---- Clang 16 | GCC 13 ---- Clang 16 | GNU 13.2.1 |
| Original | 6.554 ---- 6.176 | 5.725 ---- 5.377 | 5.018 |
| Optimized | 5.642 ---- 5.131 | 4.803 ---- 4.672 | 5 |
| Overall | 14% ---- 17% | 16% ---- 13% | 0 |
What kind of audio did you use to get these numbers?
Perhaps somewhat rude, but I am currently not inclined to accept this PR. This change is not covered by fuzzing, provides a modest improvement for a very niche use. So, I feel like the maintenance burden of this addition is not proportional to the gains.
Hi @ktmf01
The optimization is designed to improve high bit rate audio. The test WAVs are detailed below. It's all stereo, 24bps, music files across many genres.
This table was against macOS 14.4 + 16 inch MBP M2.
| Duration | Sample rate | Base time (s) | Optimized time (s) | Improvement (x times) | |
|---|---|---|---|---|---|
| Sample 1 | 08:47 | 44.1 | 1.843 | 1.685 | 1.09 |
| Sample 2 | 22:35 | 48 | 4.856 | 4.376 | 1.11 |
| Sample 3 | 15:36 | 88.2 | 8.28 | 6.537 | 1.27 |
This change is not covered by fuzzing, provides a modest improvement for a very niche use.
We'd be happy to review the PR further, including if you'd like additional fuzzing tests. A performance improvement for high res audio seems inline with FLAC use cases, i.e. https://github.com/xiph/flac/issues/669
~Please provide some more information on the tests, because I cannot explain the results you're getting. The problem I have with this, is that the function you are optimizing should only be rarely used for 24-bit material, if at all. So, what kind of preset are you using, and what times are you measuring? Is it possible to provide gprof reports with and without this optimization on your hardware?~
Okay, nevermind, this function is indeed used for 24-bit material. But why have you made this "NEON-exclusive"? It seems to me this code would work just fine for AMD64 as well and provide a performance benefit.
Considering fuzzing, the problem is not in the code, but in the hardware. Currently fuzzing is not run on ARMv8/v9 hardware, meaning this code is not fuzzed, because it is exclusive for that hardware. If it were not exclusive, fuzzing would not be a problem
Appreciate the dialog, thank you for taking the time to review and consider these changes!
We made it exclusive because the benefit was much more pronounced on that architecture and flat on others. You are correct that it could be enabled for other architectures and that would also resolve the fuzzing issue.
We'd be happy to modify the PR to not constrain it to only NEON. If there are specific WAVs and presets the project would like to see it tested against let us know! We will update with the results using those inputs and the expanded architecture.
Okay, nevermind, this function is indeed used for 24-bit material.
While this was not the intent of this PR, thank you for (indirectly) pointing me to this. I've merged two PRs, #700 and #702, which should make flac use the function you optimized, FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual, way less often.
Could you perhaps rebase this PR against the current state of the repository, and re-evaluate this PR?
Of course, standby.
After testing with the changes merged into #700 and #702 the optimization opportunities identified by this PR now are negligible.
Thanks for the review and consideration!