Add Llama.cpp Support
This PR adds support for Llama.cpp and closes #167.
Hey @bayedieng just checking in. Anything I can help with to move this along?
Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.
My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions function of the ChatGPT API. I will be implementing it manually later in the call stack.
I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.
How is this going @bayedieng? Anything I can help with?
I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.
I started a branch using ggml given that the llama cpp api doesn't expose the weights and thus wouldn't be able to be sharded. What would then be the requirements of having this PR done then? Each model type (vision or pure text LLM) would have to be implemented separately as ggml does not have an auto model generation similar to pytorch.
Let's start with Llama, which would support Llama, Phi and Mistral model weights since they're all based on Llama. That would be sufficient for this bounty. We can add the other models in a follow-up bounty.
How's this going @bayedieng? Anything I can help with?
Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.
Was waiting on confirmation for the bounty requirements. I should have a working llama implementation within the working week.
Checking in again.
You can use the PR for TorchInferenceEngine as a reference: https://github.com/exo-explore/exo/pull/139
Thanks, I'm quite used to the Exo codebase at this point however, I've been struggling quite a bit with the GGML API as it's quite low level. it is largely undocumented and essentially any error you may have with regards to it will lead to a crash with very little information as to why.
This will likely take much more time than I had initially anticipated. With that being said I think a simpler solution to support llama cpp would be to use candle as it also supports the GGUF format for weights that llama cpp uses, and also has python bindings with an implementation of quantized llama which would essentially be the same I would have implemented using GGML but with a much simpler API.
I fully understand if you'd still want to go forward with llama cpp as it may not fill your needs but using candle would be much simpler.
Made a few simpler changes in another branch that is in conflict with this one will be closing and opening a new one.