llama.cpp
llama.cpp copied to clipboard
Xeon Phi (Knights Corner) Support.
Most of the gains come from an assembly implementation of Q5K . Q8K dot product code, written in IMCI assembly.
goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 1.2 tokens per second.
goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.
Are you showing a performance regression? Or are the figures flipped.
goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.
Are you showing a performance regression? Or are the figures flipped. It's a four time performance increase. my apologies, that's a confusing way to say things.
i meant: llama_print_timings: eval time = 29558.98 ms / 24 runs ( 1231.62 ms per token, 0.81 tokens per second)
from 0.18 tokens per second.
There's more room for improvement, but I thought I'd submit early, to get feedback.
Hi, What hardware is used? I'm not familiar with this series.
https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480
https://en.m.wikipedia.org/wiki/Xeon_Phi
goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.
Are you showing a performance regression? Or are the figures flipped.
No? It's a four time performance increase. There's more room for improvement, but I thought I'd submit early, to get feedback.
Hi, What hardware is used? I'm not familiar with this series.
https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480
https://en.m.wikipedia.org/wiki/Xeon_Phi
That's exactly it, I'm using a 5110P at the moment.
goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.
Are you showing a performance regression? Or are the figures flipped.
No? It's a four time performance increase. There's more room for improvement, but I thought I'd submit early, to get feedback.
Hi, What hardware is used? I'm not familiar with this series. https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480 https://en.m.wikipedia.org/wiki/Xeon_Phi
That's exactly it, I'm using a 5110P at the moment.
That stated, that article is really full of errors; this is a 60 core quad-threaded beast (240 cores in /proc/cpuinfo), that runs on IMCI, the predecessor of the AVX-512 instruction set. It carries 8GB of GDDR5.
goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 0.82 tokens per second.
How many threads is that with? Since Xeon Phi has 4 threads per core it could be interesting to experiment with thread counts and see what that changes
goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 0.82 tokens per second.
How many threads is that with? Since Xeon Phi has 4 threads per core it could be interesting to experiment with thread counts and see what that changes
- I'll play with that a bit. :)
now runs at 1.2 tokens per second.