klosax
klosax
Great! :)
How much of all the work done in this repo could easily be transferred to future models and architectures? It looks like the happy days of the original LLaMA models...
> I was actually able to convert, quantize and load the model, but there is some tensor math to debug and modify but I have no 40GB gpu to debug...
@nikisalli : On the [model card](https://huggingface.co/tiiuae/falcon-40b#model-architecture-and-objective) it says "head_dim 64 Reduced to optimise for FlashAttention" but in the config.json the number is 128. Maybe try reducing it to 64?
Generation speed for StoryWriter model: at token 1000, about 300 ms per token at token 8000, about 2500 ms per token So if tokens generated is increased 8 times, the...
We may need the new [Sophia Optimizer](https://arxiv.org/abs/2305.14342) for a 2X increase in training speed compared to Adam.
The architecture that Falcon uses is different than those currently supported. More discussion here: https://github.com/ggerganov/llama.cpp/issues/1602
Falcon LLM ggml framework with CPU and GPU support: https://github.com/cmp-nct/ggllm.cpp
> This actually changes a bit more than that PR; feel free to close though I will close #206 .
> gpt_neox_model_load: ggml ctx size = 17592186043162.29 MB It seems to be a calculation error with signed and unsigned integers. Change `int` to `size_t` in [these](https://github.com/ggerganov/ggml/blob/758471b22630cc037244dbe1961a87097988aa75/examples/gpt-neox/main.cpp#LL159C1-L162C45) lines: ``` const int...