llama.cpp
llama.cpp copied to clipboard
AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring
Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into inline
functions.
q4_3
42.94 seconds per pass - ETA 7.81 hours
prompt eval time = 54411.09 ms / 631 tokens ( 86.23 ms per token) bs=512
prompt eval time = 59126.51 ms / 631 tokens ( 93.70 ms per token) bs=8
eval time = 41742.69 ms / 255 runs ( 163.70 ms per run)
q4_1
35.42 seconds per pass - ETA 6.45 hours
prompt eval time = 52762.72 ms / 631 tokens ( 83.62 ms per token) bs=512
prompt eval time = 56287.43 ms / 631 tokens ( 89.20 ms per token) bs=8
eval time = 41024.48 ms / 255 runs ( 160.88 ms per run)
Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there.
Before merging this: the current Q4_3
format / implementation is not very efficient with ARM NEON:
Time per token on M1 Pro:
-
Q4_0
:48ms
-
Q4_1
:55ms
-
Q4_2
:48ms
-
Q4_3
:87ms
I want to make it close to ~50-60 ms / token. But I think we might have to change the format if the optimization from #1083 does not work out.
Will try to optimize this with highest priority, so we can decide on the final Q4_3
format
Well #1083 was a bit rushed IMO, but I tried to address the loose ends.
For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX.
Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for quantize_row_q8_0
, but I'm not sure I can trust the compiler to verify that I'm not accidentally using AVX2. It would be great if someone with an AVX-only machine could test this.