llama.cpp
llama.cpp copied to clipboard
llama : add Falcon LLM support
Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (~~with royalties for over $1 million revenue per year~~) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.
https://huggingface.co/tiiuae/falcon-40b
First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context.
From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.
It's using flash attention and multiquery. gg already has branches with flashattention. I don't see that "implementation barrier" ?
I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when it comes to some tests I ran, not as good as Turbo overall but I need to develop new tests now as Falcon-40B can beat all of those I currently had in the "Legacy/GPT-4 only" section.
there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?
https://github.com/Birch-san/falcon-play
there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?
https://github.com/Birch-san/falcon-play
Falcon has the full precision binaries available here: https://huggingface.co/tiiuae/falcon-40b/tree/main https://huggingface.co/tiiuae/falcon-40b-instruct https://huggingface.co/tiiuae/falcon-7b https://huggingface.co/tiiuae/falcon-7b-instruct https://huggingface.co/tiiuae/falcon-rw-1b
From there it should start, the pre-quantized versions are not useful imho.
I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large). I also tried some bias tests (given it's origin), the instruct Falcon 40B instruct is surprisingly unbiased, it felt like a bit of Turbo or GPT-4 "tuning" went into it 'As an AI model'. It remains to be tested and compared in detail of course.
It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve.
Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks).
I don't know why I'm not able to convert it to .ggml, like other models.
Loading model file /mnt/m/llama_model/falcon-40b/pytorch_model-00009-of-00009.bin
Traceback (most recent call last):
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1168, in <module>
main()
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1148, in main
model_plus = load_some_model(args.model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1076, in load_some_model
model_plus = merge_multifile_models(models_plus)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 583, in merge_multifile_models
model = merge_sharded([mp.model for mp in models_plus])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in merge_sharded
return {name: convert(name) for name in names}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in <dictcomp>
return {name: convert(name) for name in names}
^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in convert
lazy_tensors: List[LazyTensor] = [model[name] for model in models]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in <listcomp>
lazy_tensors: List[LazyTensor] = [model[name] for model in models]
~~~~~^^^^^^
KeyError: 'transformer.word_embeddings.weight'
@danmaxis
I don't know why I'm not able to convert it to .ggml, like other models.
Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.
The conversion app can't handle Falcon models yet.
@danmaxis
I don't know why I'm not able to convert it to .ggml, like other models.
Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.
The conversion app can't handle Falcon models yet.
@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful!
How much of all the work done in this repo could easily be transferred to future models and architectures?
It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard
As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm.
The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built.
https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902
@jessejohnson
can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3?
I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information.
This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like transformer.word_embeddings.weight
. Models also usually are set up with some main level tensors and then a set of tensors that are repeated in a number of layers. So you might have main_tensor
and then layer.0.tensor1
, layer.0.tensor2
, layer.1.tensor1
etc. How the tensors are named depends on both the model architecture and the file format. GGML might call the same tensor a different thing from the HuggingFace format.
Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc.
Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have main_tensor + layer.0.tensor2 * layer.0.tensor1 * 1.321
a FALCON model might have layer.0.first.weight / (main_bias * 0.5) + layer.0.second.bias
or whatever. I just made up completely random names there, they don't actually relate to anything.
The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well.
Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture.
Thanks @KerfuffleV2, this is exactly what I was looking for!
I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.
Though looks like a bit of a nightmare to adapt everything :(
I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.
Though looks like a bit of a nightmare to adapt everything :(
Can bloomz.cpp run this model?
Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It would be a nice first step to get it running there but llama.cpp is the platform with all the features.
while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.
As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get.
They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae .
According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one.
https://www.tii.ae/news/uaes-falcon-40b-worlds-top-ranked-ai-model-technology-innovation-institute-now-royalty-free
They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ?
All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache.
With Falcon-40B being significantly better than LLaMA-65B, and actually being fully open source under Apache 2.0, it's definitely the new king of open source LLMs. It would be great to see support for it in llama.cpp!
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now
I can give you the quantized model if you want to continue my work.
https://github.com/nikisalli/falcon.cpp
I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now
Great work! Why dont you start with the 7B model instead? It should require less memory.
@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one
Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B :hugs:
Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp.
@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?
@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all...
I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch
OH, you are running the python one. my bed. but still, should be able to force cpu mode.
nope :( some layers are not implemented for cpu and half precision!
It's bf16 and I can't run it in my device too.
I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B:
Without multi_query (40B):
self.query_key_value = Linear(
self.hidden_size,
(config.n_head_kv * 2 + config.n_head) * self.head_dim,
bias=config.bias,
)
self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
self.attention_dropout = nn.Dropout(config.attention_dropout)
self.num_kv = config.n_head_kv
With multi_query (7B):
self.query_key_value = Linear(
self.hidden_size,
3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
bias=config.bias,
)
self.multi_query = config.multi_query
self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
self.attention_dropout = nn.Dropout(config.attention_dropout)
self.num_kv = 1
The relevant config for both: Config without multiquery (40B):
"hidden_size": 8192,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "RefinedWeb",
"n_head": 128,
"n_head_kv": 8,
"n_layer": 60,
"parallel_attn": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 65024
Config with multiquery (7B):
"hidden_size": 4544,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "RefinedWebModel",
"multi_query": true,
"n_head": 71,
"n_layer": 32,
"parallel_attn": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.27.4",
"use_cache": true,
"vocab_size": 65024
In the conversion python module for 7B we'll also need the conv_map changed: 'input_layernorm' : 'attention_norm', # 7B The handling of k,q,v re-shape is also different for both