Kirat Pandya
Kirat Pandya
See #59. Working example there.
Working sample: https://gist.github.com/kiratp/dfcbcf0aa713a277d5d53b06d9db9308 First off: thanks to everyone who's built and shared working snippets over time. The gist above works 100% reliably against a LetsEncrypt TLS backend. Hopefully this helps...
Here is the scaling on an M1 Max (7B int4, maxed out GPU, 64 GB RAM) 
> Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a _substantially_ lower number than what is available on my system....
M1 Max, maxed GPU, 64 GB. Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s 10 or so...
Threadripper 3990x with 256 GB You can see where the memory bandwidth/contention becomes the bottleneck ``` Running with 32 threads... 32 threads | run 1/3 | current token time 21.28...
The relevant bit > Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s...
Been a while since I've written any C, but here is a Rust program: https://github.com/kiratp/memory-bandwidth Result with a bunch of stuff running on an M1 Max, 64 GB. I would...
Alright so I got GPT4 to write me a C equivalent. I am not sure as to its quality but cursory analysis seems to indicate that it is correct but...
Try setting threads to the physical core count, not the thread count - `-t 8` ggml/llama.cpp is memory bandwidth bound - there is a lot of open discussion about this...