klosax

Results 37 comments of klosax

How much of all the work done in this repo could easily be transferred to future models and architectures? It looks like the happy days of the original LLaMA models...

> I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug...

@nikisalli : On the [model card](https://huggingface.co/tiiuae/falcon-40b#model-architecture-and-objective) it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?

Generation speed for StoryWriter model: at token 1000, about 300 ms per token at token 8000, about 2500 ms per token So if tokens generated is increased 8 times, the...

We may need the new [Sophia Optimizer](https://arxiv.org/abs/2305.14342) for a 2X increase in training speed compared to Adam.

The architecture that Falcon uses is different than those currently supported. More discussion here: https://github.com/ggerganov/llama.cpp/issues/1602

Falcon LLM ggml framework with CPU and GPU support: https://github.com/cmp-nct/ggllm.cpp

> This actually changes a bit more than that PR; feel free to close though I will close #206 .

> gpt_neox_model_load: ggml ctx size = 17592186043162.29 MB It seems to be a calculation error with signed and unsigned integers. Change `int` to `size_t` in [these](https://github.com/ggerganov/ggml/blob/758471b22630cc037244dbe1961a87097988aa75/examples/gpt-neox/main.cpp#LL159C1-L162C45) lines: ``` const int...