exo icon indicating copy to clipboard operation
exo copied to clipboard

Example setups with benchmarks

Open AlexCheema opened this issue 1 year ago • 11 comments

Users want to know exactly which setups work, how to set them up, and what the benchmarks are.

A simple benchmark we can do is Mac Minis. We have 4 of them, so we can just progressively add Mac minis and measure the tok/sec.

Ref:

Screenshot 2024-07-16 at 09 11 37

AlexCheema avatar Jul 16 '24 16:07 AlexCheema

Request: Raspberry Pi. We could bundle it with Coral USB TPU (https://coral.ai/products/) which could be super cost effective home ai inference.

Screenshot 2024-07-16 at 10 24 06

AlexCheema avatar Jul 16 '24 17:07 AlexCheema

List of supported hardware:

Screenshot 2024-07-16 at 13 46 40

AlexCheema avatar Jul 16 '24 20:07 AlexCheema

2x4090 on exo vs bunch of 4090s in one pc

IMG_0076

AlexCheema avatar Jul 17 '24 17:07 AlexCheema

We could bundle it with Coral USB TPU

Coral is a little bit hard to get nowadays

There is also this partnership https://www.raspberrypi.com/news/raspberry-pi-ai-kit-available-now-at-70/

audkar avatar Jul 17 '24 20:07 audkar

The people want to know how fast it is.

IMG_7523

AlexCheema avatar Jul 18 '24 18:07 AlexCheema

Screenshot 2024-07-19 at 02 22 29

AlexCheema avatar Jul 19 '24 09:07 AlexCheema

IMG_0093

AlexCheema avatar Jul 19 '24 11:07 AlexCheema

Yes, ideally, I'd like to know:

  • Cluster details
  • Model used, including bit size
  • Settings used, like context window size, caching settings
  • Cold start or 2nd+ run
  • Prompt used
  • Time to first token
  • Tokens/sec after first token

jhgoodwin avatar Dec 09 '24 14:12 jhgoodwin

Related to this question of benchmarking. It is my basic understanding that your cluster cannot exceed the speed of the fastest single machine's ability to process a layer, correct?

Eg, if I can run the model in a single M4 mini base model, 2x M4 mini base model will increase total throughput of the cluster (eg, can have 2 simultaneous requests), but single requests will be the same speed as before, is that correct?

jhgoodwin avatar Dec 09 '24 14:12 jhgoodwin

In my experience, adding a node halves the total tokens/sec throughput :(

NotReallyADeveloper avatar Dec 10 '24 20:12 NotReallyADeveloper

Wen benchmarks?

slavakurilyak avatar Dec 22 '24 21:12 slavakurilyak