llm
llm copied to clipboard
Performance of inference with 65B model on high-end CPU?
How well does this model perform on a CPU? Are there benchmarks for running some of the bigger models (like LLaMA-65B) on a CPU)?
I don't have enough RAM to test, but I'd suggest looking at performance numbers for llama.cpp - we should be about on par (barring any improvements that we haven't kept up with)
I don't have hard numbers, but I get somewhere under one token/sec on a 7950x with alpaca-lora-65b-ggml-q4_0. You probably won't find models larger than 30B to be practical.