llama.cpp
llama.cpp copied to clipboard
Implement MosiacML's 7B model.
Comparative to Llama in results I believe and also commercially available for use!
https://huggingface.co/mosaicml/mpt-7b
https://www.mosaicml.com/blog/mpt-7b
Lets start with a basic inference example in the ggml repo.
If it lives up to the hype, we can think about also integrating it in llama.cpp
so we get all the infrastructure benefits or maybe something better depending on the results.
Licensed as Apache 2.0, and a context length of 65k! Yes, would be great to have this supported in llama.cpp.
MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths. It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. At inference time, thanks to ALiBi, MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens, and we have demonstrated generations as long as 84k tokens on a single node of A100-80GB GPUs.
that's a looooong context.
As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting. Only 7B parameters is a little disappointing but one thing we're learning is not to judge a model by it's parameter count.
One of the first things I did when I found this project was to hack my own custom context reset to restart on a sentence boundary and leave only ~10% free space for context generation instead of 50%, just to keep the context more relevant. It was terribly inefficient but that's how bad I wanted a longer context length. There's really no substitute to having more (relevant) text in the context.
I also opened an issue for it here: https://github.com/mosaicml/llm-foundry/issues/60
@jploski thanks for starting this conversation in ggml
and llm-foundry
! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.
Would you be open to sharing that branch of ggml
? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!
@jploski thanks for starting this conversation in
ggml
andllm-foundry
! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.Would you be open to sharing that branch of
ggml
? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!
FWIW:
https://github.com/jploski/ggml/tree/mpt-experiment/examples/mpt
See commit comments and "TODO" in source code and README.md in examples/mpt for things that I do not understand. The main challenge seems to be that MPT uses a transformer model with customized code (found in their llm-foundry repository), so it is probably silly to expect the stablelm code to just work. All I did was some (rather uninformed) guessing.
Also note that the inference will not even start for mpt-7b-storywriter using the default context length of 65535 - it will just complain about "not enough space in the context's memory pool" and segfault. But this can be worked around by specifying a smaller n_ctx (rather than letting it load from GGML/model config).
Please do not let this first attempt raise false hopes or stand in the way of an actual implementation. I do not feel qualified enough (wrt to understanding transformers) to implement it. At best my branch can save someone some typing for the boilerplate (and at worst it can mislead, too).
As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting
Just be aware that your RAM may run out and even if you evict to disk, it will be extremely slow due to quadratic scaling.
@jon-chuang I think with ALiBi it will not be quadratic scaling. Fix me if I am wrong.
Generation speed for StoryWriter model:
at token 1000, about 300 ms per token at token 8000, about 2500 ms per token
So if tokens generated is increased 8 times, the generation time per token is increased about 8.3 times.
@s-kostyaev AliBi is a positional encoding method, and has nothing to do the cost of attention.
https://paperswithcode.com/method/alibi
@klosax exactly, that is quadratic scaling.
Note that storywriter (and similarly claude's 100K context length) are largely impractical at the claimed lengths. I am betting on the next gen models with log-linear context scaling based on long convolutions to gain prominence. See https://github.com/HazyResearch/safari
Any plans/updates?
Any plans/updates?
Maybe the integration will become easier after Falcon #1602 - because that could be the first non-LLaMA model to obtain llama.cpp support and pave the way for others.
working mpt inference can be found here ggml/examples/mpt
working mpt inference can be found here ggml/examples/mpt
How close is this to building main.exe to work with mpt models?
Just checking in as well, with the ggml example would we be able to get an implementation? @ggerganov
I think the next big steps that need to happen are:
- Finalize https://github.com/ggerganov/ggml/issues/220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
- Refactor model loading
llama.cpp
- currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models
We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp
Alternatively, a quick'n'dirty implementation of MPT in llama.cpp
with tons of ifdefs
and hacks can be done on a branch relatively quickly. But it is not something we want on master
as it will bring further technical dept to the codebase
I think the next big steps that need to happen are:
- Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
- Refactor model loading
llama.cpp
- currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA modelsWe should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into
llama.cpp
Alternatively, a quick'n'dirty implementation of MPT in
llama.cpp
with tons ofifdefs
and hacks can be done on a branch relatively quickly. But it is not something we want onmaster
as it will bring further technical dept to the codebase
Are there any llama.cpp branches working on MPT implementations currently?
As far as the ggml: unified file format, that's really interesting and I'm trying to understand it better, but could a standard "descriptive file" be developed in conjunction to support unknown formats by describing hyperparameters of whatever ggml file is supplied with it? I'm just wondering if that even makes sense, to allow for non unified files to work with readers that may accept a second "descriptive file. "
Could this be easier with the new GGUF format?
I just wanted to see if there were any updates on this? It would be great to have MPT Storywriter in Ollama.
I'm also very interested on progress here 😊
We now kind of have a process for adding new models to llama.cpp
(see Falcon, StarCoder and Baichuan).
Looking for contributions to do something similar for Mosaic
Some progress, see https://github.com/ggerganov/llama.cpp/pull/3417
(You can help testing by checking out https://github.com/jploski/llama.cpp/tree/mpt)
Maybe it's a place to note that there is a pretty complete set of gguf quantized mpt models available at my Huggingface Account, with a handy mpt-collection
Implemented in #3417