Example setups with benchmarks
Users want to know exactly which setups work, how to set them up, and what the benchmarks are.
A simple benchmark we can do is Mac Minis. We have 4 of them, so we can just progressively add Mac minis and measure the tok/sec.
Ref:
Request: Raspberry Pi. We could bundle it with Coral USB TPU (https://coral.ai/products/) which could be super cost effective home ai inference.
List of supported hardware:
2x4090 on exo vs bunch of 4090s in one pc
We could bundle it with Coral USB TPU
Coral is a little bit hard to get nowadays
There is also this partnership https://www.raspberrypi.com/news/raspberry-pi-ai-kit-available-now-at-70/
The people want to know how fast it is.
Yes, ideally, I'd like to know:
- Cluster details
- Model used, including bit size
- Settings used, like context window size, caching settings
- Cold start or 2nd+ run
- Prompt used
- Time to first token
- Tokens/sec after first token
Related to this question of benchmarking. It is my basic understanding that your cluster cannot exceed the speed of the fastest single machine's ability to process a layer, correct?
Eg, if I can run the model in a single M4 mini base model, 2x M4 mini base model will increase total throughput of the cluster (eg, can have 2 simultaneous requests), but single requests will be the same speed as before, is that correct?
In my experience, adding a node halves the total tokens/sec throughput :(
Wen benchmarks?