Kirat Pandya

42Layers

Results 12 comments of


                                            Kirat Pandya

TLS mutual authentication

See #59. Working example there.

TLS support

Working sample: https://gist.github.com/kiratp/dfcbcf0aa713a277d5d53b06d9db9308 First off: thanks to everyone who's built and shared working snippets over time. The gist above works 100% reliably against a LetsEncrypt TLS backend. Hopefully this helps...

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

Here is the scaling on an M1 Max (7B int4, maxed out GPU, 64 GB RAM) ![image](https://user-images.githubusercontent.com/427577/235381401-e06583aa-1935-4c53-af75-ad87c70ba6ad.png)

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

> Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a _substantially_ lower number than what is available on my system....

benchmarks?

M1 Max, maxed GPU, 64 GB. Note that M1 Pro vs Max matters beyond core count here since memory bandwidth doubles - 200 GB/s -> 400 GB/s 10 or so...

benchmarks?

Threadripper 3990x with 256 GB You can see where the memory bandwidth/contention becomes the bottleneck ``` Running with 32 threads... 32 threads | run 1/3 | current token time 21.28...

benchmarks?

The relevant bit > Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s...

benchmarks?

Been a while since I've written any C, but here is a Rust program: https://github.com/kiratp/memory-bandwidth Result with a bunch of stuff running on an M1 Max, 64 GB. I would...

benchmarks?

Alright so I got GPT4 to write me a C equivalent. I am not sure as to its quality but cursory analysis seems to indicate that it is correct but...

I'm pegging CPU (`./examples/chat.sh` works very slowly) on a 5800X3D / u22 linux, anything that can be done?

Try setting threads to the physical core count, not the thread count - `-t 8` ggml/llama.cpp is memory bandwidth bound - there is a lot of open discussion about this...

1
2
›