llama.cpp
llama.cpp copied to clipboard
Not an issue but what depends on the number of threads?
I've been testing your code from 1 to 8 threads and the output is always different. The speed is not depend on the number of threads. On the contrary, 4 threads may perform much better than 1, whereas 8 threads supposedly provides a better result. However, the same prompt may give the same excellent output with triple speed with 4 threads compared to 8. But still, when I use 8 threads (my maximum on M1) I use all my CPU resources, but it doesn't affect speed at all (seemingly works slower) and not giving quality effect (apparently). Am I wrong? Can you correct me if I'm mistaken? May be there is some best speed/quality option and I just that stupid that was unable to figure out how to use this option?
The code is memory bound somewhere between 8 and 16 threads on my 16 core system. I suspect your system has 4 cores / 8 hyperthreads. Hyperthreading isn't helping your performance.
The output may subtly change with different numbers of threads due to the multithreading architecture of the code, but the average quality shouldn't.
M1 definitely has 8 physical cores (and I believe it has fairly high memory bandwidth but may be wrong). It could have something to do with 4 of those cores being lower-performance efficiency cores, but spreading the workload across more cores should still improve performance.
Going from 4 to 7-8 helps, but only marginally. Maybe if they were pinned..