flac Optimize NEON encoding for high dynamic range blocks

We can improve the encoding performance on Apple silicon configurations with the following enhancements

Replace the switch with a for loop
Invert qlp_factors & remove else conditions as branches not taken are returned out

13% - 17% improvement depending on compiler, configuration and .wav

Clang 16 or GCC 13 necessary

Platform	MBP18,2	Mac14,12	AMD Ryzen 9 7900X x 24
Compiler	GCC 13 ---- Clang 16	GCC 13 ---- Clang 16	GNU 13.2.1
Original	6.554 ---- 6.176	5.725 ---- 5.377	5.018
Optimized	5.642 ---- 5.131	4.803 ---- 4.672	5
Overall	14% ---- 17%	16% ---- 13%	0

Jan 02 '24 18:01 Developer-Ecosystem-Engineering

What kind of audio did you use to get these numbers?

Perhaps somewhat rude, but I am currently not inclined to accept this PR. This change is not covered by fuzzing, provides a modest improvement for a very niche use. So, I feel like the maintenance burden of this addition is not proportional to the gains.

Feb 26 '24 19:02 ktmf01

Hi @ktmf01

The optimization is designed to improve high bit rate audio. The test WAVs are detailed below. It's all stereo, 24bps, music files across many genres.

This table was against macOS 14.4 + 16 inch MBP M2.

	Duration	Sample rate	Base time (s)	Optimized time (s)	Improvement (x times)
Sample 1	08:47	44.1	1.843	1.685	1.09
Sample 2	22:35	48	4.856	4.376	1.11
Sample 3	15:36	88.2	8.28	6.537	1.27

This change is not covered by fuzzing, provides a modest improvement for a very niche use.

We'd be happy to review the PR further, including if you'd like additional fuzzing tests. A performance improvement for high res audio seems inline with FLAC use cases, i.e. https://github.com/xiph/flac/issues/669

Mar 11 '24 21:03 Developer-Ecosystem-Engineering

~Please provide some more information on the tests, because I cannot explain the results you're getting. The problem I have with this, is that the function you are optimizing should only be rarely used for 24-bit material, if at all. So, what kind of preset are you using, and what times are you measuring? Is it possible to provide gprof reports with and without this optimization on your hardware?~

Okay, nevermind, this function is indeed used for 24-bit material. But why have you made this "NEON-exclusive"? It seems to me this code would work just fine for AMD64 as well and provide a performance benefit.

Considering fuzzing, the problem is not in the code, but in the hardware. Currently fuzzing is not run on ARMv8/v9 hardware, meaning this code is not fuzzed, because it is exclusive for that hardware. If it were not exclusive, fuzzing would not be a problem

Mar 12 '24 07:03 ktmf01

Appreciate the dialog, thank you for taking the time to review and consider these changes!

We made it exclusive because the benefit was much more pronounced on that architecture and flat on others. You are correct that it could be enabled for other architectures and that would also resolve the fuzzing issue.

We'd be happy to modify the PR to not constrain it to only NEON. If there are specific WAVs and presets the project would like to see it tested against let us know! We will update with the results using those inputs and the expanded architecture.

Mar 14 '24 03:03 Developer-Ecosystem-Engineering

Okay, nevermind, this function is indeed used for 24-bit material.

While this was not the intent of this PR, thank you for (indirectly) pointing me to this. I've merged two PRs, #700 and #702, which should make flac use the function you optimized, FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual, way less often.

Could you perhaps rebase this PR against the current state of the repository, and re-evaluate this PR?

May 16 '24 14:05 ktmf01

Of course, standby.

May 16 '24 15:05 Developer-Ecosystem-Engineering

After testing with the changes merged into #700 and #702 the optimization opportunities identified by this PR now are negligible.

Thanks for the review and consideration!

Jul 16 '24 16:07 Developer-Ecosystem-Engineering