flac icon indicating copy to clipboard operation
flac copied to clipboard

Decoding/Encoding with avx512

Open LGFae opened this issue 4 months ago • 3 comments

Has there been any attempts at this? I couldn't find any.

Avx512 is now present in a many commercially available CPUs, with the last two generations of AMD CPUs having them.

I would like to implement it myself, but I would like to know if there is any interest from the maintainers in having this kind of thing in the codebase.

From what I can tell, the code seems very neatly organized, so it would be a simple matter of re-implementing some files using the newer instructions. I believe it wouldn't be that difficult to maintain either.

LGFae avatar Aug 24 '25 13:08 LGFae

Hi,

Thanks for showing interest in helping to develop this. I know there have been people busy with this at hydrogenaudio.org, but not with intrinsics, just enabling the compiler to generate AVX512 and looking at the result.

I have mixed feelings about AVX512, and intrinsics in general. First, I have no hardware to test any submission on. Intel has dropped AVX-512 support for customer CPUs, and reintroduction will probably take a few years, so it is specific Intel CPUs or quite recent AMDs, which I currently don't have access to. Also, I cannot rely on fuzzing or CI for testing this. So, if you were to write AVX512 intrinsics, I feel I cannot merge them until I've had the chance to get my hands on such hardware to test.

In fact, I'd rather get rid of intrinsics code where possible. I've removed quite some already where I found out both GCC and Clang can autovectorize better (or nearly as good).

I do have a suggestion though. It might not be as challenging as writing AVX-512 code with intrinsics, and it is more comparing than writing code, but it would be useful for machines without AVX-512 as well. Perhaps you can take a look at how the C code can be optimized such that GCC and Clang are better able to autovectorize the code. In the past I've found that refactoring the code can give a dramatic improvement in autovectorization performance. That would, at least in theory, not just improve AVX-512 performance, but AVX2 and NEON as well.

ktmf01 avatar Aug 24 '25 19:08 ktmf01

I understand. I've also had similar reservations with intrinsics since they tend to multiply the maintenance burden.

However, I believe that by just refactoring the code, we would lose runtime detection of CPU features, no?

LGFae avatar Aug 24 '25 21:08 LGFae

Not necessarily. Please take a look at

https://github.com/xiph/flac/blob/master/src/libFLAC/lpc_intrin_fma.c

combined with

https://github.com/xiph/flac/blob/master/src/libFLAC/deduplication/lpc_compute_autocorrelation_intrin.c

Similarly, BMI2 acceleration has been accomplished without intrinsics:

https://github.com/xiph/flac/blob/9547dbc2ddfca06a70ea937dbb605bbe78ea5f90/src/libFLAC/bitreader.c#L832-L842

These reference to:

https://github.com/xiph/flac/blob/master/src/libFLAC/deduplication/bitreader_read_rice_signed_block.c

ktmf01 avatar Aug 25 '25 12:08 ktmf01