exo icon indicating copy to clipboard operation
exo copied to clipboard

Question about Exo

Open yuqiao9 opened this issue 1 year ago • 3 comments

I have successfully deployed Exo. Could you explain why, for the same Llama 3.1, Exo’s model size is about three times larger than OllaMa’s, and also why Exo’s execution speed is far inferior to that of OllaMa?

yuqiao9 avatar Sep 05 '24 08:09 yuqiao9

image Is it caused by a configuration error on my side?

yuqiao9 avatar Sep 05 '24 08:09 yuqiao9

Hey, thanks for reporting this.

The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: https://github.com/exo-explore/exo/issues/148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

AlexCheema avatar Sep 05 '24 12:09 AlexCheema

Hey, thanks for reporting this.

The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

Thank you, I will try again later. It's not convenient to try now. I have another question. When I am reasoning, it feels like it is being executed by the CPU. Words appear one by one, and there is a noticeable lag. I set the environment variable CUDA=1 and the video memory has a main.py program. What is the reason for his slow reasoning.Does this also align with the reason you mentioned above?

yuqiao9 avatar Sep 06 '24 01:09 yuqiao9

@yuqiao9 the benefits of quantisation is not just that a “big” LLM can fit into a smaller amount of memory, it also reduces the amount of data that has to be moved into GPU for each calculation, which makes it much faster. FP16 (i.e. 16-bit) model has far more math to do than 4-bit quantised which just takes a lot longer, for every token.

You can probably close this issue if your questions are answered now.

Rjvs avatar Oct 21 '24 04:10 Rjvs