llama2.c icon indicating copy to clipboard operation
llama2.c copied to clipboard

runomp on Mac M1 Max is slower than runfast

Open tairov opened this issue 8 months ago • 10 comments

Recently I did extensive benchmarks of llama2.c ports I found that C version in runfast mode (singlethreaded) is working faster than runomp (multi threaded)

make runomp CC=/opt/homebrew/opt/llvm/bin/clang; OMP_NUM_THREADS=5 ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 529.976019

VS

make runfast; ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 657.738095

Does anyone have insights into why this might be happening?

full benchmark report

tairov avatar Oct 21 '23 06:10 tairov