llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Adding a simple program to measure speed of dot products

Open ikawrakow opened this issue 1 year ago • 3 comments

I was surprised by the belief that the dot product x * y, where x holds quantized model weights and y contains floating point values, it is faster to quantize y, and to perform the dot product using the quantized y values (accepting the associated loss in precision), then to just directly compute x * y. So, I had to try it myself. This PR adds a simple program that measures the time it takes to perform the dot product between vectors holding 1 << 18 values. I picked a relatively large vector size to not have to get involved with the science of accurately measuring elapsed time for short lasting operations.

Basically, we fill two vectors x and y with random values and quantize x into q. We then measure the time for

  1. Computing d = q * y directly
  2. Computing y' = quantize(y); d = q * y'

For 2. we use the vectorized (SIMD-ified) functions from ggml (or, if requested by a command line argument, the corresponding scalar functions from ggml).

On my Mac, 1. is faster than 2 (~55 us vs ~75 us). On the x86_64 CPU that I have available (Ryzen 7950X), 1. is somewhat slower compared to the AVX2 implementation (~50 us vs ~35 us).

On both CPUs the direct product 1. as implemented in the dot() function in this POC is much faster than the scalar version of 2 from ggml. (~15X faster on the Ryzen 7950X and ~6X faster on the Mac).

I think that with some ARM_NEON or AVX2 magic one should be able to further speed up 1.

To use it, make -j and then e.g. ./vdot 100 to measure 100 dot products with the SIMD-ified ggml functions, or ./vdot 100 1 to measure the scalar ggml functions instead.

Added a comparison for Q4_1 quantization. Here, the direct product 1. is faster than 2. for ARM_NEON and AVX2. On my Mac I get ~69 us for 1 and ~121 us for 2. On the Ryzen 7950X I measured ~60 us for 1. and ~62 us for 2. In any case, implemented as in this POC, the dot product of Q4_1 quantized values is only marginally slower (~20%) than Q4_0.

ikawrakow avatar Apr 18 '23 14:04 ikawrakow

On my Core i3-8100 (AVX2):

$ ./vdot 100
<dot> = -74.2272, -73.9193
time = 128.407 +/- 4.38483 us. maxt = 150.281 us
timeq = 106.679 +/- 4.16175 us. maxt = 126.484 us

Please consider putting it in examples/benchmark instead of creating a new folder.

sw avatar Apr 18 '23 15:04 sw

@sw Thank you for the measurement. Yes, of course, I can move to examples. My thinking was that this is a POC, so it is better to have a folder for POCs for this (and possibly future POCs) before some of these POCs become "examples".

ikawrakow avatar Apr 18 '23 15:04 ikawrakow

Yes, examples is maybe not a great name, but it already contains various bits and pieces like your program.

sw avatar Apr 18 '23 15:04 sw