llama.cpp
llama.cpp copied to clipboard
Enable Fused-Multiply-Add (FMA) and F16C/CVT16 vector extensions on MSVC
__FMA__
and __F16C__
are defined in GCC and Clang
__FMA__
and __F16C__
are not defined in MSVC, however they are implied with AVX2/AVX512
https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160
Thus, enable FMA and F16C in MSVC if either AVX2/AVX512 is enabled
It seems I'm too tired to find the button for converting to a draft, but anyway. The _cvtss_sh and _cvtss_ss intrinsics are still missing and not implemented yet, so don't merge yet.
I haven't checked out the compiled output at the disassembly level yet, so especially in the case of F16C there is the consideration as to which extent the compiler had already optimized the generic ggml_compute_fp16_to_fp32 and fp32_to_fp16 to use the cvt/f16c instructions. The answer to that question also answers the question whether this change can bring possibly a significant performance increase or do pretty much nothing at all.
Avx2/avx512 also implies all the simd instructions being enabled like sse3
Avx2/avx512 also implies all the simd instructions being enabled like sse3
Yeah, but the __SSE3__
wasn't currently used as __AVX__
takes precedence over it, so I didn't add it (#elif defined(__SSE3__)
is after #elif defined(__AVX__)
e: i guess its a good addition anyway if possibly used in the future. won't hurt.
does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160
does that macro even exist? https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-160
It doesn't, that is the entire point.
Do you observe improved performance with this change?
This got FMA enabled while building from VS, windows, on i7 8th gen. However, time per token seems to be the same (under 1% diff)
Do you observe improved performance with this change?
I'll have to take a in-depth look later analysing the binary code and timing the performance, until then no idea. In the case of FMA the difference between _mm256_fmadd_ps(b, c, a)
and _mm256_add_ps(_mm256_mul_ps(b, c), a)
is probably marginal. The impact from F16C intrinsics could be greater. Obviously the results can also vary between different processor lines, but generally I'd expect the functions baked into the processor to do exactly that computation perform better than using general computation. However we've also seen before that it's not always the case (f.ex. for AVX-512, at least a few years ago in the earliest Intel/AMD consumer SKUs to include this functionality, the implementation was less than stellar and in many cases made the performance worse when using it, but afaik that is still more an exception than the norm. however that case was also an example of how these things can be hard to measure, since iirc at least in the intel models the problem wasn't that the avx512 calculations themselves weren't faster, but that using the avx512 slowed down other calculations which made the total impact negative).
That didn't really answer your question. :smile:
Thanks @lofcz for providing some initial testing. If anyone else wants to chip in with their results including the processor and model parameters of the test, that'd be greatly appreciated.
I'd expect this to increase performance in range of +0% to +X%, but especially important would be to make sure that this will not decrease performance in any case.
These are my runtimes on my Ryzen 4500U (Zen 2)
Without FMA is built upon e4412b45e395981068d2850d3fa04cc16c77d70d while with FMA just adds on the commits in this pull request
Without FMA runs faster?
Values are from tokens / ms here:
EDIT: Maybe I'm wrong at that value is runtime? In that case FMA patch improves performance by 6.24%?
I've tried to do the same as @nicknitewolf
I have an Intel Xeon W-2295
So I guess on my system there is little to no influence on the performance for the eval time, however the sample time seems to be a bit better. However, the sample time has little effect on the total time.
The original systems information and loading timings for the 7B and 65B are:
The 7B timings:
The 65B timings:
After the modifications:
The 7B timings:
The 65B timings:
Huge thanks @nicknitewolf and @KASR providing some statistics. :+1: :partying_face:
I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even -t 4
would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.
It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.
@KASR could you run the tests with -t 4
and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.
Rebased the branch to master for easier testing.
Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳
I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even
-t 4
would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.
@KASR could you run the tests with
-t 4
and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.
Ah I should have done that, setting a specified thread count, maybe thats why my results vary so much
Even based on your tests alone I would think that merging this is a good idea. All the other platforms use the extensions already and MSVC is also supposed to use them as they are implied with /arch:AVX2, the only problem here is really the #define not being set so MSVC takes a slower codepath than all the other compilers.
The reason for variance is probably just Windows, unfortunately, and there is not much to be done about it except running more tests to decrease their significance. I'm actually still maining Windows 7 for exactly this reason, since a fresh Win7 installation with the crap cut down runs under 100 threads total when idle. ~5 Watt and ~0,5% CPU usage. Windows 10 on the other hand has 800+ at any given time and can go up into the thousands when some updates or whatever Store/Xbox/Cortana nonsense is going in the background. Open your task manager and see ur threadcount when you are supposedly doing nothing and see all the bloat eating up your cpu cycles. It's really hard to properly perftest anything in the modern windowses and most of the crap is baked in so heavily into the system that its nigh-impossible to remove it all without crippling the OS.
That being said, I do not recommend maining Win7 anymore since hardware and software support is on it's very last legs. Unfortunate, since the OS itself is pretty much perfect and 100% stable, haven't had a system crash or even a malfunction in years.
@slaren And looking at the current alternative paints a pretty clear picture :smile:
And as seen in the before part, vex instructions like vcvtsi2ss
are still used since AVX2 implies their use, its just that the most optimized version wasnt used because of the missing flag.
After PR
GGML_COMPUTE_FP16_TO_FP32 PROC ; COMDAT
movzx eax, cx
vmovd xmm0, eax
vcvtph2ps xmm0, xmm0
ret 0
GGML_COMPUTE_FP16_TO_FP32 ENDP
GGML_COMPUTE_FP32_TO_FP16 PROC ; COMDAT
vmovaps xmm1, xmm0
vxorps xmm0, xmm0, xmm0
vmovss3 xmm1, xmm0, xmm1
vcvtps2ph xmm2, xmm1, 0
vpextrw eax, xmm2, 0
ret 0
GGML_COMPUTE_FP32_TO_FP16 ENDP
Before PR
GGML_COMPUTE_FP16_TO_FP32 PROC ; COMDAT
movzx eax, cx
shl eax, 16
mov edx, eax
and edx, -2147483648 ; 80000000H
lea ecx, DWORD PTR [rax+rax]
mov eax, ecx
shr eax, 4
add eax, 1879048192 ; 70000000H
mov DWORD PTR fp32$3[rsp], eax
mov eax, ecx
shr eax, 17
or eax, 1056964608 ; 3f000000H
mov DWORD PTR fp32$2[rsp], eax
cmp ecx, 134217728 ; 08000000H
jae SHORT $LN5@GGML_COMPU
vmovss xmm0, DWORD PTR fp32$2[rsp]
vsubss xmm1, xmm0, DWORD PTR __real@3f000000
vmovss DWORD PTR tv87[rsp], xmm1
mov eax, DWORD PTR tv87[rsp]
or eax, edx
mov DWORD PTR fp32$1[rsp], eax
vmovss xmm0, DWORD PTR fp32$1[rsp]
ret 0
$LN5@GGML_COMPU:
vmovss xmm0, DWORD PTR fp32$3[rsp]
vmulss xmm1, xmm0, DWORD PTR __real@07800000
vmovss DWORD PTR tv87[rsp], xmm1
mov eax, DWORD PTR tv87[rsp]
or eax, edx
mov DWORD PTR fp32$1[rsp], eax
vmovss xmm0, DWORD PTR fp32$1[rsp]
ret 0
GGML_COMPUTE_FP16_TO_FP32 ENDP
GGML_COMPUTE_FP32_TO_FP16 PROC ; COMDAT
$LN17:
sub rsp, 56 ; 00000038H
vmovaps XMMWORD PTR [rsp+32], xmm6
vmovaps xmm6, xmm0
vcvtss2sd xmm0, xmm6, xmm0
vmovq rcx, xmm0
call fabsf
vmovd r8d, xmm6
vmovaps xmm6, XMMWORD PTR [rsp+32]
mov ecx, 1895825408 ; 71000000H
vxorps xmm1, xmm1, xmm1
vcvtsi2ss xmm1, xmm1, eax
vmulss xmm2, xmm1, DWORD PTR __real@77800000
vmulss xmm3, xmm2, DWORD PTR __real@08800000
lea edx, DWORD PTR [r8+r8]
and r8d, -2147483648 ; 80000000H
mov eax, edx
and eax, -16777216 ; ff000000H
cmp eax, ecx
cmovb eax, ecx
shr eax, 1
add eax, 125829120 ; 07800000H
mov DWORD PTR fp32$1[rsp], eax
vaddss xmm1, xmm3, DWORD PTR fp32$1[rsp]
vmovd ecx, xmm1
mov eax, ecx
and ecx, 4095 ; 00000fffH
shr eax, 13
and eax, 31744 ; 00007c00H
add eax, ecx
mov ecx, 32256 ; 00007e00H
cmp edx, -16777216 ; ff000000H
cmova ax, cx
shr r8d, 16
or ax, r8w
add rsp, 56 ; 00000038H
ret 0
GGML_COMPUTE_FP32_TO_FP16 ENDP
Hold merging this until #546 is merged.
Huge thanks @nicknitewolf and @KASR providing some statistics. 👍 🥳
I've concluded that unfortunately as my CPU is dog and only has 4 threads total, I can't provide useful statistics myself since even
-t 4
would mean no free threads left for the OS itself and thus external factors have too much impact to produce reliable results.It seems that while @KASR's results are inconclusive being inside margin of error, @nicknitewolf did produce a significant 6.24% increase in performance.
@KASR could you run the tests with
-t 4
and see if there is difference then? Your beast of a processor running 18 threads might have a different result to something less powerful. You could also simulate a lesser processor by locking the thread affinity to two cores so the threads would stay locked in the same cores and it couldn't utilize the advantages of having high core count.
Yes sure, I've updated to the newest commit (at the time of writing) and enabled AVX512. I only performed with the 7B model, let me know if it's interesting to also see the results for the 65B.
I've used the command: ./main -m ./models/7B/ggml-model-q4_0.bin -s 1679164839 -n 128 -t 4
Original settings:
After adjustments:
I've also added the results using t 20, ( i list the xx ms/token as value ):
so using t4 --> 6.55% speedup (which is very close to the value @nicknitewolf had) using t20 --> 5.15% speedup
Much appreciated ! :+1:
Thanks to everyone taking part to this, with special thanks to @nicknitewolf and @KASR to take the time to do benchmarks. While a 5% speed increase might not be very noticeable on its' own, all the performance increases add up and are important as parts of the big picture.
After such well made research we can be definitely confident that this PR is a go.
I'll merge this right after the CI is fixed #546