flac Remove all assembler

libFLAC still contains quite a bit of assembler. However, I'm unable to properly maintain this and I'm not sure it is being covered by fuzzing. Also, it only benefits 32-bit machines that lack SSE4.1, which was introduced in 2006. The last Intel processor lacking SSE4.1 were the Intel Atom CPUs with codename Penwell, which were succeeded by Merrifield in 2014. The last AMD processor lacking SSE4.1 was the Bobcat series, which was succeeded by Jaguar in 2013.

I revived an old computer I had in storage with an AMD Sempron 3000+ CPU, which has ISA extensions up to SSE3, thus lacking SSSE3 and SSE4.1.

There was no measurable difference for 16-bit inputs.

For 24-bit inputs, preset 2 (only using fixed subframes) gave a speed increase of about 3%. Preset 8 shows a speed decrease of about 8% Setting -8p gives a speed decrease of 12%. To me, these decreases seem an acceptable trade-off. Note that this only applies to encoding on 32-bit machines lacking SSE 4.1, and does not apply to 16-bit inputs.

I probably won't merge this for a while. Feedback is welcome.

Sep 15 '22 19:09 ktmf01

I just did some more checking with an Intel Atom N450, which has ISA extensions up to SSSE3.

Similarly to the AMD Sempron and as was expected, this made no change for 16-bit inputs.

For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was less than 1%. For -8p the slowdown was 2%.

So, differences with this CPU are even smaller than with the Sempron.

Sep 16 '22 19:09 ktmf01

More testing, this time with a Pentium 4 2.6GHz, Family 15, Model 2. As far as I can find out, that means it has codename Northwood, has a 130nm lithography and supports MMX, SSE and SSE2.

No changes for 16-bit inputs.

For 24-bit inputs encoded with compression level 2 no measurable difference was seen. For -8 the slowdown was 3%. For -8p the slowdown was 5%.

Sep 22 '22 18:09 ktmf01

Are your performance numbers ASM vs -march=i686 (or anything similarly generic) or ASM vs -march=pentium4/bonnell/athlon-xp for the particular CPU? Given that such old CPUs are nowadays kind of a special case, generating optimized code for exactly the CPU one wants to target may be somewhat more manageable than it was when they were current.

Note that even though I am someone who likes to keep old systems around and still tries to properly support them myself, and regardless of which numbers you did provide, I actually do agree with the decision to remove this assembler code form FLAC. I personally do not have any use case to care about FLAC encode performance on such systems though, so my opinion probably does not weigh much.

Sep 23 '22 12:09 manxorist

Actually, this has nothing to do with march tunings. FLAC is by default configured to add -msse2 to the command line (and similar to Visual Studio project files). I made two static builds without any -march or -mtune, one with and one without this patch. So, these performance number are HEAD vs HEAD^, nothing else really.

Regarding support, this does not drop support for anything. You probably know, but I just wanted to write that down for anyone reading this. FLAC can pretty much be compiled for anything, I've done testing on a 486 some time ago, no problem. It is just that FLAC is configured by default on x86 to add -msse2 to the command line (but this is easy to undo) and this PR drops hardware-specific improvements. FLAC can still be compiled for a 486 (and probably even older), it will just not run as fast. I think that is fair, given how old these CPUs are.

Sep 23 '22 13:09 ktmf01

Actually, this has nothing to do with march tunings. FLAC is by default configured to add -msse2 to the command line (and similar to Visual Studio project files).

I had originally missed that you have an SSE3 Sempron, so probably K8-based instead of Athlon-XP-based. -msse2 would not work on an Athlon-XP era Sempron, as they only have SSE1. Note that there were a lot of CPUs called "Sempron 3000+" with very different CPU cores and features sets (https://en.wikipedia.org/wiki/List_of_AMD_Sempron_processors).

I made two static builds without any -march or -mtune, one with and one without this patch. So, these performance number are HEAD vs HEAD^, nothing else really.

My point/question was only to gauge how much (if anything) of the performance loss of removing the x86 assembler implementation could potentially be re-gained by compiling explicitly for the specific CPU at hand.

In any case, as said, I personally am fine with removing the x86 assembler implementation.

Sep 23 '22 14:09 manxorist

I've just tested on an AMD Phenom II X2 550. That is of a different generation than the Sempron

Here's an overview of all tested systems

CPU	SSE	24-bit audio -2	24-bit audio -8	24-bit audio -8p
AMD Sempron 3000+	SSE3	+3%	-8%	-12%
AMD Phenom II X2 550	SSSE3 + SSE4a	0%	-1%	-2%
Intel Pentium 4 2.6GHz, Family 15, Model 2	SSE2	0%	-3%	-5%
Intel Atom N450	SSSE3	0%	-1%	-2%

On none of the systems a measureable difference for 16-bit audio was found.

Oct 06 '22 19:10 ktmf01

The numbers provided by @ktmf01 do not reflect the cost on CPUs where the impact is to be expected the most severe: Those which do not support SSE2, and thus did not already make use of SSE2 for most algorithms anyway:

CPU: AMD Duron 1800 (AthlonXP core, MMX, MMXEXT, 3DNOW, 3DNOWEXT, SSE, no SSE2) Compiler: GCC 12.2.0

build settings:

1.4.0: ./configure --disable-sse --enable-static --disable-shared
remove-asm: ./configure --disable-sse --enable-static --disable-shared
remove-asm-opt: CFLAGS="-march=native -mtune=native -O3 -funroll-loops" CXXFLAGS="-march=native -mtune=native -O3 -funroll-loops" ./configure --disable-sse --enable-static --disable-shared

test commands:

16bit -2: time ./src/flac/flac -c -2 /tmp/16.wav > /dev/null
16bit -8: time ./src/flac/flac -c -8 /tmp/16.wav > /dev/null
16bit -8p: time ./src/flac/flac -c -8p /tmp/16.wav > /dev/null
24bit -2: time ./src/flac/flac -c -2 /tmp/24.wav > /dev/null
24bit -8: time ./src/flac/flac -c -8 /tmp/24.wav > /dev/null

Reported time is user time, best of 3 runs.

version	16bit -2	16bit -8	16bit -8p	24bit -2	24bit -8
1.4.0	1.252s	10.483s	47.617s	2.000s	15.302s
remove-asm	1.395s (-11%)	12.424s (-19%)	66.886s (-40%)	1.964s (+2%)	16.742s (-9%)
remove-asm-opt	1.447s (-16%)	12.536s (-20%)	67.768s (-42%)	2.001s (-0%)	16.911s (-11%)

So, on systems without SSE2, the impact on 16bit input is actually way worse than the impact on 24bit input.

My hunch that some of the loss could be mitigated by tuning for the particular CPU used did turn out to not be true. I guess modern GCC does not tune properly for old CPUs any more.

I still agree with removing the x86 asm implementation because of maintenance reasons.

Oct 17 '22 08:10 manxorist

Thanks for testing! I didn't have access to a pre-SSE2 CPU that I could run these tests on without too much trouble. I have a Pentium Pro around but that has Windows 98 installed, on which FLAC doesn't work because of unicode support.

I think the performance hit for -8 is fair, 20% isn't too bad. Frankly I expected way worse. -8p is IMO slow on modern CPUs already, so I wouldn't use it on CPUs that old anyway.

Oct 17 '22 08:10 ktmf01

The OBJ_FORMAT stuff in configury can be removed too after this: https://github.com/xiph/flac/blob/master/configure.ac#L225-L234

Oct 17 '22 08:10 sezero

The OBJ_FORMAT stuff in configury can be removed too after this: https://github.com/xiph/flac/blob/master/configure.ac#L225-L234

Well I was wrong about this one because the visibility attribute checks later rely on OBJ_FORMAT for simplicity. That AC_SUBST(OBJ_FORMAT) can certainly be removed though.

Oct 17 '22 18:10 sezero