llama.cpp llama : add Falcon LLM support

Falcon LLM 40b and 7b were just open sourced under a license which allows commercial use (~~with royalties for over $1 million revenue per year~~) and have are topping the Huggingface Open LLM leaderboard. It seems to be based on a modified gpt3 architecture. I’m wondering if support in llama.cpp would be considered.

https://huggingface.co/tiiuae/falcon-40b

May 26 '23 17:05 someone13574

First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context.

From what I read, I've not tested it, the model seems significantly better than llama, while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

It's using flash attention and multiquery. gg already has branches with flashattention. I don't see that "implementation barrier" ?

May 28 '23 13:05 cmp-nct

I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when it comes to some tests I ran, not as good as Turbo overall but I need to develop new tests now as Falcon-40B can beat all of those I currently had in the "Legacy/GPT-4 only" section.

May 28 '23 17:05 cmp-nct

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

May 28 '23 20:05 dseddah

there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ?

https://github.com/Birch-san/falcon-play

Falcon has the full precision binaries available here: https://huggingface.co/tiiuae/falcon-40b/tree/main https://huggingface.co/tiiuae/falcon-40b-instruct https://huggingface.co/tiiuae/falcon-7b https://huggingface.co/tiiuae/falcon-7b-instruct https://huggingface.co/tiiuae/falcon-rw-1b

From there it should start, the pre-quantized versions are not useful imho.

I'm not 100% sure yet but from my tests I believe that we have a superior successor to llama at our hands that covers all our use cases (from small to large). I also tried some bias tests (given it's origin), the instruct Falcon 40B instruct is surprisingly unbiased, it felt like a bit of Turbo or GPT-4 "tuning" went into it 'As an AI model'. It remains to be tested and compared in detail of course.

It solved riddles Turbo, Alpaca and OpenAssist 30B can not solve.

Carefully said: It looks like the 40B Falcon might outperform the largest 65B llama (it does so in the benchmarks).

May 28 '23 21:05 cmp-nct

I don't know why I'm not able to convert it to .ggml, like other models.

Loading model file /mnt/m/llama_model/falcon-40b/pytorch_model-00009-of-00009.bin
Traceback (most recent call last):
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1168, in <module>
    main()
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1148, in main
    model_plus = load_some_model(args.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 1076, in load_some_model
    model_plus = merge_multifile_models(models_plus)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 583, in merge_multifile_models
    model = merge_sharded([mp.model for mp in models_plus])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in merge_sharded
    return {name: convert(name) for name in names}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 562, in <dictcomp>
    return {name: convert(name) for name in names}
                  ^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in convert
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danmaxis/llama_local/llamacpp_src/llama.cpp/./convert.py", line 537, in <listcomp>
    lazy_tensors: List[LazyTensor] = [model[name] for model in models]
                                      ~~~~~^^^^^^
KeyError: 'transformer.word_embeddings.weight'

May 29 '23 15:05 danmaxis

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

May 29 '23 15:05 KerfuffleV2

@danmaxis

I don't know why I'm not able to convert it to .ggml, like other models.

Because it is a different type of model. LLaMA based models have a certain structure. Falcon is not based on LLaMA, there's a different set of tensors, the tensors have different names, etc.

The conversion app can't handle Falcon models yet.

@KerfuffleV2 can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3? Will be super grateful!

May 30 '23 10:05 jessejohnson

How much of all the work done in this repo could easily be transferred to future models and architectures?

It looks like the happy days of the original LLaMA models may soon be over, as it starts to get beaten by models with different architectures and more attractive licensing. Open LLM Leaderboard

As the flora of LLM architectures will continue to grow and new ones will replace the old, I think this repo and the LLM examples in the ggml repo should be merged into something like ggml_llm.

The ggml_llm would contain all the common LLM code (main inference / perplexity / filehandling / quantization / sampling ..) and the code for each architecture could be like plugins added at compile time. The gpt4all-backend may be a good starting point for how such structure could be built.

https://github.com/ggerganov/ggml/issues/185 https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902

May 30 '23 10:05 klosax

@jessejohnson

can you give me (us, really) an ELI5 of the LLaMA architecture and how it differs from, say GPT-3?

I don't want to get too offtopic here so if you want detailed information you'd probably be better off creating a discussion. I also don't really know the specific architecture of GP-3, etc, so I can't tell you the exact way two specific types of model differ, just provide some general information.

This is a bit simplified, but a model consists of a bunch of tensors (just big arrays of numbers in various dimensions). The tensors generally have names, like transformer.word_embeddings.weight. Models also usually are set up with some main level tensors and then a set of tensors that are repeated in a number of layers. So you might have main_tensor and then layer.0.tensor1, layer.0.tensor2, layer.1.tensor1 etc. How the tensors are named depends on both the model architecture and the file format. GGML might call the same tensor a different thing from the HuggingFace format.

Anyway, to actually run a model one performs a bunch of math operations on those tensors. Some of the operations are simple like addition, multiplication, some are more complex and can have complicated logic internally like rope, alibi, matrix multiplication, etc.

Which tensors exist in a model and what sequence of those math operations are used to evaluate the model depends on the model architecture. While a LLaMA based model might have main_tensor + layer.0.tensor2 * layer.0.tensor1 * 1.321 a FALCON model might have layer.0.first.weight / (main_bias * 0.5) + layer.0.second.bias or whatever. I just made up completely random names there, they don't actually relate to anything.

The code in something like this project which evaluates a type of model it supports (say LLaMA for example) is set up to look for tensors with specific names, grab that data, perform the various operations in the correct order and then it also expects the result from those operations to be in a specific format as well.

Hopefully this makes it more clear why specific support needs to be added to ML tools to support models that actually have a different architecture.

May 30 '23 13:05 KerfuffleV2

Thanks @KerfuffleV2, this is exactly what I was looking for!

May 30 '23 13:05 jessejohnson

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

May 30 '23 13:05 cmp-nct

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already.

Though looks like a bit of a nightmare to adapt everything :(

Can bloomz.cpp run this model?

May 30 '23 16:05 iHaagcom

Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It would be a nice first step to get it running there but llama.cpp is the platform with all the features.

May 30 '23 16:05 cmp-nct

while it has a kind of shitty license for commercial growth (free until 1MM/y revenue, then 10%) it's better than illegal.

As of 3 hours ago, they tweeted that they will forgo any royalties for commercial and research uses. I don't know what this means in practice but Falcon might become the first capable genuinly-opensource model we get.

May 31 '23 16:05 real-andrew

They've just updated their Huggingface to confirm that the models are now available under Apache 2.0: https://huggingface.co/tiiuae .

May 31 '23 17:05 logicchains

According to their announcement on the official site, it's the Falcon 40B that is now under Apache 2.0. Not sure if they intend to do same for the smaller models, or if they plan an even larger, license-restricted one.

https://www.tii.ae/news/uaes-falcon-40b-worlds-top-ranked-ai-model-technology-innovation-institute-now-royalty-free

May 31 '23 17:05 jessejohnson

They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model. Best thing ever exported from Abu Dhabi ?

May 31 '23 18:05 cmp-nct

All models and datasets from them are now confirmed to be Apache 2.0. The model repositories still contain the old license.txt, but the models themselves are tagged Apache.

May 31 '23 18:05 Googulator

With Falcon-40B being significantly better than LLaMA-65B, and actually being fully open source under Apache 2.0, it's definitely the new king of open source LLMs. It would be great to see support for it in llama.cpp!

May 31 '23 19:05 JohnAlcatraz

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

I can give you the quantized model if you want to continue my work.

https://github.com/nikisalli/falcon.cpp

May 31 '23 19:05 nikisalli

I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug the tensor values at each layer! so it produces garbage for now

Great work! Why dont you start with the 7B model instead? It should require less memory.

May 31 '23 19:05 klosax

@klosax it is still too big! To debug the weights the model needs to be loaded in fp16 on the gpu. this means that a 24GB gpu is needed in the case of the 7B model and I do not posses one

May 31 '23 20:05 nikisalli

Truthfully though the initial Falcon work should be done on 7B to ease development; I think the architecture is the same regardless of model size. If it gets traction I'm sure someone with a big GPU will hop in and help with the 40B :hugs:

Like it or not, Llama is limited by its legality and truly open models like Falcon are the way forwards for llama.cpp.

May 31 '23 20:05 ghost

@nikisalli : On the model card it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?

May 31 '23 20:05 klosax

@nikisalli what do you need the gpu for? why not cpu?, ggml/llama.cpp is known for its ability to run on cpu after all...

May 31 '23 21:05 Green-Sky

I find it useful to run the pytorch model with many print statements here and there to check that ggml is giving me the same numbers so that I know what operations to touch

May 31 '23 21:05 nikisalli

OH, you are running the python one. my bed. but still, should be able to force cpu mode.

May 31 '23 21:05 Green-Sky

nope :( some layers are not implemented for cpu and half precision!

May 31 '23 22:05 nikisalli

It's bf16 and I can't run it in my device too.

May 31 '23 23:05 FNsi

I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B:

Without multi_query (40B):

        self.query_key_value = Linear(
            self.hidden_size,
            (config.n_head_kv * 2 + config.n_head) * self.head_dim,
            bias=config.bias,
        )
        self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
        self.attention_dropout = nn.Dropout(config.attention_dropout)
        self.num_kv = config.n_head_kv

With multi_query (7B):

        self.query_key_value = Linear(
            self.hidden_size,
            3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
            bias=config.bias,
        )
        self.multi_query = config.multi_query
        self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
        self.attention_dropout = nn.Dropout(config.attention_dropout)
        self.num_kv = 1

The relevant config for both: Config without multiquery (40B):

  "hidden_size": 8192,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "RefinedWeb",
  "n_head": 128,
  "n_head_kv": 8,
  "n_layer": 60,
  "parallel_attn": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.27.4",
  "use_cache": true,
  "vocab_size": 65024

Config with multiquery (7B):

 "hidden_size": 4544,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "RefinedWebModel",
  "multi_query": true,
  "n_head": 71,
  "n_layer": 32,
  "parallel_attn": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.27.4",
  "use_cache": true,
  "vocab_size": 65024

In the conversion python module for 7B we'll also need the conv_map changed: 'input_layernorm' : 'attention_norm', # 7B The handling of k,q,v re-shape is also different for both

Jun 01 '23 02:06 cmp-nct

llama.cpp llama.cpp copied to clipboard

llama : add Falcon LLM support

llama.cpp
llama.cpp copied to clipboard