llama.cpp Xeon Phi (Knights Corner) Support.

Most of the gains come from an assembly implementation of Q5K . Q8K dot product code, written in IMCI assembly.

goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 1.2 tokens per second.

Apr 02 '24 17:04 julialongtin

goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.

Are you showing a performance regression? Or are the figures flipped.

Apr 02 '24 17:04 Titaniumtown

goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.

Are you showing a performance regression? Or are the figures flipped. It's a four time performance increase. my apologies, that's a confusing way to say things.

i meant: llama_print_timings: eval time = 29558.98 ms / 24 runs ( 1231.62 ms per token, 0.81 tokens per second)

from 0.18 tokens per second.

There's more room for improvement, but I thought I'd submit early, to get feedback.

Apr 02 '24 17:04 julialongtin

Hi, What hardware is used? I'm not familiar with this series.

https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480

https://en.m.wikipedia.org/wiki/Xeon_Phi

Apr 02 '24 17:04 BarfingLemurs

goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.

Are you showing a performance regression? Or are the figures flipped.

No? It's a four time performance increase. There's more room for improvement, but I thought I'd submit early, to get feedback.

Hi, What hardware is used? I'm not familiar with this series.

https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480

https://en.m.wikipedia.org/wiki/Xeon_Phi

That's exactly it, I'm using a 5110P at the moment.

Apr 02 '24 17:04 julialongtin

goes from a token every 0.18 seconds on mistral 7B instruct to a token every 0.82 seconds.

Are you showing a performance regression? Or are the figures flipped.

No? It's a four time performance increase. There's more room for improvement, but I thought I'd submit early, to get feedback.

Hi, What hardware is used? I'm not familiar with this series. https://www.techpowerup.com/gpu-specs/xeon-phi-5110p.c2480 https://en.m.wikipedia.org/wiki/Xeon_Phi

That's exactly it, I'm using a 5110P at the moment.

That stated, that article is really full of errors; this is a 60 core quad-threaded beast (240 cores in /proc/cpuinfo), that runs on IMCI, the predecessor of the AVX-512 instruction set. It carries 8GB of GDDR5.

Apr 02 '24 17:04 julialongtin

goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 0.82 tokens per second.

How many threads is that with? Since Xeon Phi has 4 threads per core it could be interesting to experiment with thread counts and see what that changes

Apr 02 '24 21:04 AutonomicPerfectionist

goes from 0.18 tokens per second on mistral 7B instruct (Q5K) to 0.82 tokens per second.

How many threads is that with? Since Xeon Phi has 4 threads per core it could be interesting to experiment with thread counts and see what that changes

I'll play with that a bit. :)

Apr 02 '24 22:04 julialongtin

now runs at 1.2 tokens per second.

May 13 '24 19:05 julialongtin

llama.cpp llama.cpp copied to clipboard

Xeon Phi (Knights Corner) Support.

llama.cpp
llama.cpp copied to clipboard