Question about Exo
I have successfully deployed Exo. Could you explain why, for the same Llama 3.1, Exo’s model size is about three times larger than OllaMa’s, and also why Exo’s execution speed is far inferior to that of OllaMa?
Is it caused by a configuration error on my side?
Hey, thanks for reporting this.
The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.
I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: https://github.com/exo-explore/exo/issues/148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.
Hey, thanks for reporting this.
The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.
I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with
BEAM=2e.g.BEAM=2 python3 main.pywhich should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running withBEAM=2might be a bit slower at the start but then faster.
Thank you, I will try again later. It's not convenient to try now. I have another question. When I am reasoning, it feels like it is being executed by the CPU. Words appear one by one, and there is a noticeable lag. I set the environment variable CUDA=1 and the video memory has a main.py program. What is the reason for his slow reasoning.Does this also align with the reason you mentioned above?
@yuqiao9 the benefits of quantisation is not just that a “big” LLM can fit into a smaller amount of memory, it also reduces the amount of data that has to be moved into GPU for each calculation, which makes it much faster. FP16 (i.e. 16-bit) model has far more math to do than 4-bit quantised which just takes a lot longer, for every token.
You can probably close this issue if your questions are answered now.