llama.cpp
llama.cpp copied to clipboard
Adding a simple program to measure speed of dot products
I was surprised by the belief that the dot product x * y
, where x
holds quantized model weights and y
contains floating point values, it is faster to quantize y
, and to perform the dot product using the quantized y
values (accepting the associated loss in precision), then to just directly compute x * y
. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding 1 << 18
values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.
Basically, we fill two vectors x
and y
with random values and quantize x
into q
. We then measure the time for
- Computing
d = q * y
directly - Computing
y' = quantize(y); d = q * y'
For 2. we use the vectorized (SIMD-ified) functions from ggml
(or, if requested by a command line argument, the corresponding scalar functions from ggml
).
On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the x86_64
CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to the AVX2
implementation (~50 us vs ~35 us).
On both CPUs the direct product 1. as implemented in the dot()
function in this POC is much faster than the scalar version of 2 from ggml
. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).
I think that with some ARM_NEON
or AVX2
magic one should be able to further speed up 1.
To use it, make -j
and then e.g. ./vdot 100
to measure 100 dot products with the SIMD-ified ggml
functions, or ./vdot 100 1
to measure the scalar ggml
functions instead.
Added a comparison for Q4_1
quantization. Here, the direct product 1. is faster than 2. for ARM_NEON
and AVX2
. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product of Q4_1
quantized values is only marginally slower (~20%) than Q4_0
.
On my Core i3-8100 (AVX2):
$ ./vdot 100
<dot> = -74.2272, -73.9193
time = 128.407 +/- 4.38483 us. maxt = 150.281 us
timeq = 106.679 +/- 4.16175 us. maxt = 126.484 us
Please consider putting it in examples/benchmark instead of creating a new folder.
@sw Thank you for the measurement. Yes, of course, I can move to examples
. My thinking was that this is a POC, so it is better to have a folder for POCs for this (and possibly future POCs) before some of these POCs become "examples".
Yes, examples
is maybe not a great name, but it already contains various bits and pieces like your program.