llama.cpp AVX2 optimization for vec_dot_q4_3_q8

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

Open sw opened this issue 1 year ago • 4 comments

Apart from adding the AVX2 optimization for Q4_3, this refactors some commonly used intrinsic sequences into inline functions.

Apr 21 '23 09:04 sw

q4_3
42.94 seconds per pass - ETA 7.81 hours
prompt eval time = 54411.09 ms /   631 tokens (   86.23 ms per token) bs=512
prompt eval time = 59126.51 ms /   631 tokens (   93.70 ms per token) bs=8
       eval time = 41742.69 ms /   255 runs   (  163.70 ms per run)

q4_1
35.42 seconds per pass - ETA 6.45 hours
prompt eval time = 52762.72 ms /   631 tokens (   83.62 ms per token) bs=512
prompt eval time = 56287.43 ms /   631 tokens (   89.20 ms per token) bs=8
       eval time = 41024.48 ms /   255 runs   (  160.88 ms per run)

Except with perplexity the performance looks good compared to q4_1, not sure why there is a discrepancy there.

Apr 21 '23 13:04 slaren

Before merging this: the current Q4_3 format / implementation is not very efficient with ARM NEON:

Time per token on M1 Pro:

Q4_0 : 48ms
Q4_1 : 55ms
Q4_2 : 48ms
Q4_3 : 87ms

I want to make it close to ~50-60 ms / token. But I think we might have to change the format if the optimization from #1083 does not work out.

Will try to optimize this with highest priority, so we can decide on the final Q4_3 format

Apr 21 '23 15:04 ggerganov

Well #1083 was a bit rushed IMO, but I tried to address the loose ends.

For the horizontal sum of ints, I could not see a difference in speed between @ikawrakow's original code and @pubby's suggestion which ended up as commented-out code. The latter is AVX2-only, while the original should also work on AVX.

Apr 21 '23 16:04 sw

Finally I don't think there is a speed difference in the horizontal sums. I have now finished the AVX optimization for quantize_row_q8_0, but I'm not sure I can trust the compiler to verify that I'm not accidentally using AVX2. It would be great if someone with an AVX-only machine could test this.

Apr 21 '23 16:04 sw

llama.cpp llama.cpp copied to clipboard

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

llama.cpp
llama.cpp copied to clipboard