llama.cpp Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC

__FMA__ and __F16C__ are defined in GCC and Clang

__FMA__ and __F16C__ are not defined in MSVC, however they are implied with AVX2/AVX512 https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160

Thus, enable FMA and F16C in MSVC if either AVX2/AVX512 is enabled

Mar 22 '23 01:03 anzz1

It seems I'm too tired to find the button for converting to a draft, but anyway. The _cvtss_sh and _cvtss_ss intrinsics are still missing and not implemented yet, so don't merge yet.

Mar 22 '23 02:03 anzz1

I haven't checked out the compiled output at the disassembly level yet, so especially in the case of F16C there is the consideration as to which extent the compiler had already optimized the generic ggml_compute_fp16_to_fp32 and fp32_to_fp16 to use the cvt/f16c instructions. The answer to that question also answers the question whether this change can bring possibly a significant performance increase or do pretty much nothing at all.

Mar 22 '23 03:03 anzz1

Avx2/avx512 also implies all the simd instructions being enabled like sse3

Mar 22 '23 03:03 niclimcy

Avx2/avx512 also implies all the simd instructions being enabled like sse3

Yeah, but the __SSE3__ wasn't currently used as __AVX__ takes precedence over it, so I didn't add it (#elif defined(__SSE3__) is after #elif defined(__AVX__)

e: i guess its a good addition anyway if possibly used in the future. won't hurt.

Mar 22 '23 03:03 anzz1

does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160

Mar 22 '23 04:03 niclimcy

does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160

It doesn't, that is the entire point.

Mar 22 '23 05:03 anzz1

Do you observe improved performance with this change?

Mar 22 '23 17:03 ggerganov

This got FMA enabled while building from VS, windows, on i7 8th gen. However, time per token seems to be the same (under 1% diff)

Mar 22 '23 22:03 lofcz

Do you observe improved performance with this change?

I'll have to take a in-depth look later analysing the binary code and timing the performance, until then no idea. In the case of FMA the difference between _mm256_fmadd_ps(b, c, a) and _mm256_add_ps(_mm256_mul_ps(b, c), a) is probably marginal. The impact from F16C intrinsics could be greater. Obviously the results can also vary between different processor lines, but generally I'd expect the functions baked into the processor to do exactly that computation perform better than using general computation. However we've also seen before that it's not always the case (f.ex. for AVX-512, at least a few years ago in the earliest Intel/AMD consumer SKUs to include this functionality, the implementation was less than stellar and in many cases made the performance worse when using it, but afaik that is still more an exception than the norm. however that case was also an example of how these things can be hard to measure, since iirc at least in the intel models the problem wasn't that the avx512 calculations themselves weren't faster, but that using the avx512 slowed down other calculations which made the total impact negative).

That didn't really answer your question. :smile:

Thanks @lofcz for providing some initial testing. If anyone else wants to chip in with their results including the processor and model parameters of the test, that'd be greatly appreciated.

I'd expect this to increase performance in range of +0% to +X%, but especially important would be to make sure that this will not decrease performance in any case.

Mar 23 '23 01:03 anzz1

These are my runtimes on my Ryzen 4500U (Zen 2)

Without FMA is built upon e4412b45e395981068d2850d3fa04cc16c77d70d while with FMA just adds on the commits in this pull request

Without FMA runs faster?

Values are from tokens / ms here:

EDIT: Maybe I'm wrong at that value is runtime? In that case FMA patch improves performance by 6.24%?

Mar 23 '23 08:03 niclimcy

I've tried to do the same as @nicknitewolf

I have an Intel Xeon W-2295

So I guess on my system there is little to no influence on the performance for the eval time, however the sample time seems to be a bit better. However, the sample time has little effect on the total time.

The original systems information and loading timings for the 7B and 65B are: The 7B timings: The 65B timings:

After the modifications: The 7B timings: The 65B timings:

Mar 23 '23 14:03 KASR

Huge thanks @nicknitewolf and @KASR providing some statistics. :+1: :partying_face:

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Mar 26 '23 19:03 anzz1

Rebased the branch to master for easier testing.

Mar 26 '23 19:03 anzz1

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Ah I should have done that, setting a specified thread count, maybe thats why my results vary so much

Mar 27 '23 06:03 niclimcy

Even based on your tests alone I would think that merging this is a good idea. All the other platforms use the extensions already and MSVC is also supposed to use them as they are implied with /arch:AVX2, the only problem here is really the #define not being set so MSVC takes a slower codepath than all the other compilers.

The reason for variance is probably just Windows, unfortunately, and there is not much to be done about it except running more tests to decrease their significance. I'm actually still maining Windows 7 for exactly this reason, since a fresh Win7 installation with the crap cut down runs under 100 threads total when idle. ~5 Watt and ~0,5% CPU usage. Windows 10 on the other hand has 800+ at any given time and can go up into the thousands when some updates or whatever Store/Xbox/Cortana nonsense is going in the background. Open your task manager and see ur threadcount when you are supposedly doing nothing and see all the bloat eating up your cpu cycles. It's really hard to properly perftest anything in the modern windowses and most of the crap is baked in so heavily into the system that its nigh-impossible to remove it all without crippling the OS.

That being said, I do not recommend maining Win7 anymore since hardware and software support is on it's very last legs. Unfortunate, since the OS itself is pretty much perfect and 100% stable, haven't had a system crash or even a malfunction in years.

Mar 27 '23 17:03 anzz1

@slaren And looking at the current alternative paints a pretty clear picture :smile: And as seen in the before part, vex instructions like vcvtsi2ss are still used since AVX2 implies their use, its just that the most optimized version wasnt used because of the missing flag.

After PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        vmovd   xmm0, eax
        vcvtph2ps xmm0, xmm0
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
        vmovaps xmm1, xmm0
        vxorps  xmm0, xmm0, xmm0
        vmovss3 xmm1, xmm0, xmm1
        vcvtps2ph xmm2, xmm1, 0
        vpextrw eax, xmm2, 0
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

Before PR

GGML_COMPUTE_FP16_TO_FP32 PROC                      ; COMDAT
        movzx   eax, cx
        shl     eax, 16
        mov     edx, eax
        and     edx, -2147483648              ; 80000000H
        lea     ecx, DWORD PTR [rax+rax]
        mov     eax, ecx
        shr     eax, 4
        add     eax, 1879048192                     ; 70000000H
        mov     DWORD PTR fp32$3[rsp], eax
        mov     eax, ecx
        shr     eax, 17
        or      eax, 1056964608                   ; 3f000000H
        mov     DWORD PTR fp32$2[rsp], eax
        cmp     ecx, 134217728                            ; 08000000H
        jae     SHORT $LN5@GGML_COMPU
        vmovss  xmm0, DWORD PTR fp32$2[rsp]
        vsubss  xmm1, xmm0, DWORD PTR __real@3f000000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
$LN5@GGML_COMPU:
        vmovss  xmm0, DWORD PTR fp32$3[rsp]
        vmulss  xmm1, xmm0, DWORD PTR __real@07800000
        vmovss  DWORD PTR tv87[rsp], xmm1
        mov     eax, DWORD PTR tv87[rsp]
        or      eax, edx
        mov     DWORD PTR fp32$1[rsp], eax
        vmovss  xmm0, DWORD PTR fp32$1[rsp]
        ret     0
GGML_COMPUTE_FP16_TO_FP32 ENDP

GGML_COMPUTE_FP32_TO_FP16 PROC                      ; COMDAT
$LN17:
        sub     rsp, 56                             ; 00000038H
        vmovaps XMMWORD PTR [rsp+32], xmm6
        vmovaps xmm6, xmm0
        vcvtss2sd xmm0, xmm6, xmm0
        vmovq   rcx, xmm0
        call    fabsf
        vmovd   r8d, xmm6
        vmovaps xmm6, XMMWORD PTR [rsp+32]
        mov     ecx, 1895825408                     ; 71000000H
        vxorps  xmm1, xmm1, xmm1
        vcvtsi2ss xmm1, xmm1, eax
        vmulss  xmm2, xmm1, DWORD PTR __real@77800000
        vmulss  xmm3, xmm2, DWORD PTR __real@08800000
        lea     edx, DWORD PTR [r8+r8]
        and     r8d, -2147483648              ; 80000000H
        mov     eax, edx
        and     eax, -16777216                            ; ff000000H
        cmp     eax, ecx
        cmovb   eax, ecx
        shr     eax, 1
        add     eax, 125829120                            ; 07800000H
        mov     DWORD PTR fp32$1[rsp], eax
        vaddss  xmm1, xmm3, DWORD PTR fp32$1[rsp]
        vmovd   ecx, xmm1
        mov     eax, ecx
        and     ecx, 4095               ; 00000fffH
        shr     eax, 13
        and     eax, 31744                                ; 00007c00H
        add     eax, ecx
        mov     ecx, 32256                                ; 00007e00H
        cmp     edx, -16777216                            ; ff000000H
        cmova   ax, cx
        shr     r8d, 16
        or      ax, r8w
        add     rsp, 56                             ; 00000038H
        ret     0
GGML_COMPUTE_FP32_TO_FP16 ENDP

Mar 27 '23 18:03 anzz1

Hold merging this until #546 is merged.

Mar 27 '23 18:03 anzz1

Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳

I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4 would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.

It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.

@KASR could you run the tests with -t 4 and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.

Yes sure, I've updated to the newest commit (at the time of writing) and enabled AVX512. I only performed with the 7B model, let me know if it's interesting to also see the results for the 65B.

I've used the command: ./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -n 128 -t 4

Original settings:

After adjustments:

I've also added the results using t 20, ( i list the xx ms/token as value ):

so using t4 --> 6.55% speedup (which is very close to the value @nicknitewolf had) using t20 --> 5.15% speedup

Mar 28 '23 15:03 KASR

Much appreciated ! :+1:

Thanks to everyone taking part to this, with special thanks to @nicknitewolf and @KASR to take the time to do benchmarks. While a 5% speed increase might not be very noticeable on its' own, all the performance increases add up and are important as parts of the big picture.

After such well made research we can be definitely confident that this PR is a go.

I'll merge this right after the CI is fixed #546

Mar 28 '23 15:03 anzz1

llama.cpp llama.cpp copied to clipboard

Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC

llama.cpp
llama.cpp copied to clipboard