llama3.java icon indicating copy to clipboard operation
llama3.java copied to clipboard

Improve matrix multiplication using the Java Vector API on Apple silicon.

Open mukel opened this issue 7 months ago • 1 comments

llama.cpp runs incredibly fast on Apple silicon, I ran a build with pure CPU, and it is closer to the memory bandwidth e.g. 28 tokens/s on an M3 Pro. llama3.java seems to be rather slow on Apple silicon e.g. Q8_0 runs as fast as Q4_0 at about 4 tokens/s, something is off. On PC it's within ~10% of llama.cpp

mukel avatar Jul 21 '24 17:07 mukel