Clemens Akens
Clemens Akens
When using setup-zig on macOS ARM runners, it currently downloads the x64 (Intel) version of Zig by default. I believe the changes proposed in this PR should resolve the issue...
I haven't tried it myself, but maybe it helps to reduce the group size and thus increase the accuracy? It is currently fixed at 64: https://github.com/karpathy/llama2.c/blob/d9862069e7ef665fe6309e3c17398ded2f121bf5/export.py#L182 A group size of...
I recently incorporated multithreading into my [Zig port](https://github.com/clebert/llama2.zig) of this project and made some relevant findings. Essentially, the overhead associated with initializing and terminating multiple threads per matrix-vector multiplication can...
Yes in single-threaded mode. But this was my best ever measured run. Normally, it fluctuates between 680 and 700 tokens per second. Why there is this big variance, I don't...
The use of `@Vector` (SIMD) had the biggest effect. Without SIMD, you couldn't get anywhere near these results. Aligning the vectors to the cache line, on the other hand, did...
I forgot to mention one important optimization: `@setFloatMode(.Optimized)` It has about the same effect as setting `-ffast-math` in the C version.
@tairov I have conducted extensive benchmarks with my improved Zig implementation, using an Apple M2 Pro equipped with 12 cores and an Apple M1 Pro equipped with 10 cores. [**Benchmark...
> Hey @clebert , I really appreciate you taking the time to improve `llama2.zig`, I think the ziglang community & maintainers might get valuable insights from it. Thank you 👍🏻...
Thanks for the suggestion, I will try it out.