Add support for Chameleon
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High
This PR adds support for the Chameleon model. For now, this implementation only supports text->text inference and serves as base to implement the (more interesting) image->text, text->image and interleaved pipelines. However, such an implementation will probably require some changes to the CLI and internal architecture, so I suggest to do this in a separate PR.
Chameleon is based on the Llama-2 architecture with the following changes:
- different (pre-)tokenizer
- qk-norm
- swin-norm
Note 1: in order to enable text->text inference, the image token logits are suppressed similar to the HF implementation. This needs to be removed when support for images is added.
Note 2: I implemented swin-norm, but I haven't tested it yet, as it is only used by Chameleon-30B.
To test it:
git clone https://huggingface.co/facebook/chameleon-7b
./convert-hf-to-gguf.py chameleon-7b
build/bin/llama-cli -m chameleon-7b/ggml-model-f16.gguf --temp 0.8 -s 1000 -n 50 -p "Language modeling is " -ngl 33
Output:
Language modeling is “the task of predicting the next word in a sequence of text, given the previous words.”
To implement a language model, we can use a neural network with a bidirectional LSTM layer and a softmax output layer.
Reference (requires transformers>=4.43.0.dev0):
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
set_seed(1000)
model = AutoModelForCausalLM.from_pretrained("facebook/chameleon-7b", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("facebook/chameleon-7b")
prompt = "Language modeling is "
inputs = tokenizer.encode(prompt, return_pt=True)
out = model.generate(inputs, max_new_tokens=40)
tokenizer.decode(out)
Reference output:
Language modeling is “the task of predicting the next word in a sequence of text given the previous words.”
In other words, it's a machine learning model that takes a sequence of text as input
Partially addresses #7995.
I have uploaded GGUFs to test this PR with here.
will this ever get added :(
I think it would still be a good addition. I've resolved all conflicts with master now, so it should be ready to merge.
Thank you @nopperl looks like it got merged!
@nopperl any plans to tackle image->text and text->image?
@MasterScrat currently no plans, sorry for the late reply. AFAIK multimodal support would require a refactor of llama.cpp (https://github.com/ggerganov/llama.cpp/issues/8010#issuecomment-2376339571). I'd love to work on it, but don't have the time right now.