llm Support Falcon

Similar to MPT, Falcon is Apache licensed, weights and all!

https://huggingface.co/tiiuae/falcon-40b
https://huggingface.co/tiiuae/falcon-40b-instruct

And according to the HuggingFace leaderboard it outperforms all current open source models including MPT.

It seems having a GGML conversion done of the model is a necessary precursor to having it included.

I don't think I have the expertise to do this but we may be able to help (e.g. can give access to a V100S or V100S to do the conversion)

Jun 02 '23 12:06 zcourts

Already on it, got it converted and quantized but it produced gibberish. Im waiting on https://github.com/ggerganov/llama.cpp/issues/1602 to see how they will handle the Q, K, V weights. I dont want to create two seperate falcon-ggml ecosystems, so im waiting for the upstream ggml implementation.

Jun 02 '23 13:06 LLukas22

Ongoing discussion worth tracking here to get GG conversion https://github.com/ggerganov/llama.cpp/issues/1602

Found after posting this here. An attempt to convert has been made https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1570827592

Jun 02 '23 13:06 zcourts

Looks like our posts overlapped! Great to hear, I've offered to provide GPU access to further the work being done in https://github.com/ggerganov/llama.cpp/issues/1602 - will follow up as that progresses

Jun 02 '23 13:06 zcourts

There is now a working GGML example for 40B: https://github.com/ggerganov/ggml/pull/231

Jun 15 '23 02:06 KerfuffleV2

That's great! Maybe i will create a draft, but i would like to wait until it get's merged into ggml.

Jun 15 '23 07:06 LLukas22

Working one here https://github.com/jploski/ggml/tree/falcon40b

Jun 16 '23 12:06 iHaagcom

Yeah, I noticed that. It would be great if someone could try porting it to Rust. I'm currently quite busy implementing GPU acceleration for all architectures.😬

Jun 16 '23 12:06 LLukas22

Damn, was hoping editing the description would cancel out the issue-closing.

Anyhow - I've merged in the Falcon 7B implementation, but it doesn't handle 40B, and it requires 32-bit memory tensors as the repeat operation it uses doesn't work with 16-bit tensors. Because of these caveats - and the continuing work on (one of) the original implementations in https://github.com/cmp-nct/ggllm.cpp - I've decided to merge it in, but disable it by default.

I'll keep this issue open until Falcon is truly ready to fly.

Jun 28 '23 23:06 philpax

@LLukas22 should we close this or wait until the model format has stabilised?

Jul 27 '23 09:07 philpax

We should wait until GGUF is implemented and we have all the necessary fields in the model file.

Jul 27 '23 09:07 LLukas22