llama2.c
llama2.c copied to clipboard
runomp on Mac M1 Max is slower than runfast
Recently I did extensive benchmarks of llama2.c
ports
I found that C version in runfast
mode (singlethreaded) is working faster than runomp
(multi threaded)
make runomp CC=/opt/homebrew/opt/llvm/bin/clang; OMP_NUM_THREADS=5 ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 529.976019
VS
make runfast; ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 657.738095
Does anyone have insights into why this might be happening?