Multimodal Llama3 Support

Open xx025 opened this issue 1 year ago • 1 comments

I came across a model on Huggingface that supports Llama3 multimodal Bunny-Llama-3-8B-V: bunny-llama, and I'd like to be able to deploy it using llama-cpp-python!

But I found that the existing chat_format:llama-3 doesn't seem to support running it.

I converted it to gguf format via llama.cpp and ran it with the following configuration

python llama.cpp/convert.py \
Bunny-Llama-3-8B-V --outtype f16 \
--outfile converted.bin \
--vocab-type bpe

{
    "host": "0.0.0.0",
    "port": 8080,
    "api_key":"xx",
    "models": [
        {
            "model": "bunny-llama.gguf",
            "model_alias": "bunny-llama",
            "chat_format": "llama-3",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        }
    ]    
}

python3 -m llama_cpp.server \
--config_file bunny-llama.json

Apr 28 '24 13:04 xx025

Check out #1147 it should be merged soon. The only caveat here is that you'll need use the llava example in llama.cpp to extract the image encoder as well when you quantize the models.

Apr 28 '24 16:04 abetlen