llama3.java
llama3.java copied to clipboard
Improve matrix multiplication using the Java Vector API on Apple silicon.
llama.cpp runs incredibly fast on Apple silicon, I ran a build with pure CPU, and it is closer to the memory bandwidth e.g. 28 tokens/s on an M3 Pro. llama3.java seems to be rather slow on Apple silicon e.g. Q8_0 runs as fast as Q4_0 at about 4 tokens/s, something is off. On PC it's within ~10% of llama.cpp