exo Add Llama.cpp Support

This PR adds support for Llama.cpp and closes #167.

Aug 27 '24 12:08 bayedieng

Hey @bayedieng just checking in. Anything I can help with to move this along?

Sep 05 '24 12:09 AlexCheema

Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.

My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions function of the ChatGPT API. I will be implementing it manually later in the call stack.

I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.

Sep 05 '24 16:09 bayedieng

How is this going @bayedieng? Anything I can help with?

Sep 30 '24 22:09 AlexCheema

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

Oct 01 '24 15:10 bayedieng

I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.

Let's start with Llama, which would support Llama, Phi and Mistral model weights since they're all based on Llama. That would be sufficient for this bounty. We can add the other models in a follow-up bounty.

Oct 03 '24 22:10 AlexCheema

How's this going @bayedieng? Anything I can help with?

Oct 06 '24 17:10 AlexCheema

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

Oct 06 '24 19:10 bayedieng

Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.

Checking in again. You can use the PR for TorchInferenceEngine as a reference: https://github.com/exo-explore/exo/pull/139

Oct 11 '24 23:10 AlexCheema

Thanks, I'm quite used to the Exo codebase at this point however, I've been struggling quite a bit with the GGML API as it's quite low level. it is largely undocumented and essentially any error you may have with regards to it will lead to a crash with very little information as to why.

This will likely take much more time than I had initially anticipated. With that being said I think a simpler solution to support llama cpp would be to use candle as it also supports the GGUF format for weights that llama cpp uses, and also has python bindings with an implementation of quantized llama which would essentially be the same I would have implemented using GGML but with a much simpler API.

I fully understand if you'd still want to go forward with llama cpp as it may not fill your needs but using candle would be much simpler.

Oct 12 '24 00:10 bayedieng

Made a few simpler changes in another branch that is in conflict with this one will be closing and opening a new one.

Oct 12 '24 13:10 bayedieng