llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Implement MosiacML's 7B model.

Open rjb7731 opened this issue 1 year ago • 17 comments

Comparative to Llama in results I believe and also commercially available for use!

https://huggingface.co/mosaicml/mpt-7b

https://www.mosaicml.com/blog/mpt-7b

rjb7731 avatar May 05 '23 17:05 rjb7731

Lets start with a basic inference example in the ggml repo.

If it lives up to the hype, we can think about also integrating it in llama.cpp so we get all the infrastructure benefits or maybe something better depending on the results.

ggerganov avatar May 05 '23 20:05 ggerganov

Licensed as Apache 2.0, and a context length of 65k! Yes, would be great to have this supported in llama.cpp.

sbsce avatar May 05 '23 22:05 sbsce

MPT-7B-StoryWriter-65k+ is a model designed to read and write stories with super long context lengths. It was built by finetuning MPT-7B with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. At inference time, thanks to ALiBi, MPT-7B-StoryWriter-65k+ can extrapolate even beyond 65k tokens, and we have demonstrated generations as long as 84k tokens on a single node of A100-80GB GPUs.

that's a looooong context.

Green-Sky avatar May 06 '23 09:05 Green-Sky

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting. Only 7B parameters is a little disappointing but one thing we're learning is not to judge a model by it's parameter count.

One of the first things I did when I found this project was to hack my own custom context reset to restart on a sentence boundary and leave only ~10% free space for context generation instead of 50%, just to keep the context more relevant. It was terribly inefficient but that's how bad I wanted a longer context length. There's really no substitute to having more (relevant) text in the context.

DannyDaemonic avatar May 06 '23 10:05 DannyDaemonic

I also opened an issue for it here: https://github.com/mosaicml/llm-foundry/issues/60

jploski avatar May 06 '23 22:05 jploski

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

drbh avatar May 07 '23 01:05 drbh

@jploski thanks for starting this conversation in ggml and llm-foundry! I agree that adding Mosaic 7B is a great idea! I happen to see you mentioned that you started some work but ran into tensor formatting issue.

Would you be open to sharing that branch of ggml? Mostly because I'm eager to learn more about the quantization process and even if it is not the full implementation, it may be helpful to see others starting points. Thanks!

FWIW:

https://github.com/jploski/ggml/tree/mpt-experiment/examples/mpt

See commit comments and "TODO" in source code and README.md in examples/mpt for things that I do not understand. The main challenge seems to be that MPT uses a transformer model with customized code (found in their llm-foundry repository), so it is probably silly to expect the stablelm code to just work. All I did was some (rather uninformed) guessing.

Also note that the inference will not even start for mpt-7b-storywriter using the default context length of 65535 - it will just complain about "not enough space in the context's memory pool" and segfault. But this can be worked around by specifying a smaller n_ctx (rather than letting it load from GGML/model config).

Please do not let this first attempt raise false hopes or stand in the way of an actual implementation. I do not feel qualified enough (wrt to understanding transformers) to implement it. At best my branch can save someone some typing for the boilerplate (and at worst it can mislead, too).

jploski avatar May 07 '23 11:05 jploski

As someone who likes to occasionally use LLMs as an aid in writing, the idea of a 65k token context is exciting

Just be aware that your RAM may run out and even if you evict to disk, it will be extremely slow due to quadratic scaling.

jon-chuang avatar May 20 '23 05:05 jon-chuang

@jon-chuang I think with ALiBi it will not be quadratic scaling. Fix me if I am wrong.

s-kostyaev avatar May 20 '23 07:05 s-kostyaev

Generation speed for StoryWriter model:

at token 1000, about 300 ms per token at token 8000, about 2500 ms per token

So if tokens generated is increased 8 times, the generation time per token is increased about 8.3 times.

klosax avatar May 21 '23 23:05 klosax

@s-kostyaev AliBi is a positional encoding method, and has nothing to do the cost of attention.

https://paperswithcode.com/method/alibi

@klosax exactly, that is quadratic scaling.

Note that storywriter (and similarly claude's 100K context length) are largely impractical at the claimed lengths. I am betting on the next gen models with log-linear context scaling based on long convolutions to gain prominence. See https://github.com/HazyResearch/safari

jon-chuang avatar May 22 '23 09:05 jon-chuang

Any plans/updates?

acheong08 avatar Jun 21 '23 11:06 acheong08

Any plans/updates?

Maybe the integration will become easier after Falcon #1602 - because that could be the first non-LLaMA model to obtain llama.cpp support and pave the way for others.

jploski avatar Jun 21 '23 12:06 jploski

working mpt inference can be found here ggml/examples/mpt

Green-Sky avatar Jun 21 '23 14:06 Green-Sky

working mpt inference can be found here ggml/examples/mpt

How close is this to building main.exe to work with mpt models?

tcnevin avatar Jun 22 '23 21:06 tcnevin

Just checking in as well, with the ggml example would we be able to get an implementation? @ggerganov

Jchang4 avatar Jun 23 '23 20:06 Jchang4

I think the next big steps that need to happen are:

  • Finalize https://github.com/ggerganov/ggml/issues/220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
  • Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

ggerganov avatar Jun 24 '23 10:06 ggerganov

I think the next big steps that need to happen are:

  • Finalize ggml : unified file format ggml#220 - this will give us a unified model format that will be more future-proof and would make sense to support long term
  • Refactor model loading llama.cpp - currently it is doing too much extra unnecessary stuff like supporting old models that no longer exists. The code needs a big refactor and simplification so that we can more easily start loading non-LLaMA models

We should put a big focus on these soon and throughout July and try to bring support for most new models (MPT, Falcon, etc.) into llama.cpp

Alternatively, a quick'n'dirty implementation of MPT in llama.cpp with tons of ifdefs and hacks can be done on a branch relatively quickly. But it is not something we want on master as it will bring further technical dept to the codebase

Are there any llama.cpp branches working on MPT implementations currently?

As far as the ggml: unified file format, that's really interesting and I'm trying to understand it better, but could a standard "descriptive file" be developed in conjunction to support unknown formats by describing hyperparameters of whatever ggml file is supplied with it? I'm just wondering if that even makes sense, to allow for non unified files to work with readers that may accept a second "descriptive file. "

tcnevin avatar Jul 01 '23 07:07 tcnevin

Could this be easier with the new GGUF format?

mvsoom avatar Sep 13 '23 19:09 mvsoom

I just wanted to see if there were any updates on this? It would be great to have MPT Storywriter in Ollama.

tony352 avatar Sep 16 '23 17:09 tony352

I'm also very interested on progress here 😊

maddes8cht avatar Sep 26 '23 10:09 maddes8cht

We now kind of have a process for adding new models to llama.cpp (see Falcon, StarCoder and Baichuan). Looking for contributions to do something similar for Mosaic

ggerganov avatar Sep 27 '23 15:09 ggerganov

Some progress, see https://github.com/ggerganov/llama.cpp/pull/3417

(You can help testing by checking out https://github.com/jploski/llama.cpp/tree/mpt)

jploski avatar Sep 30 '23 17:09 jploski

Maybe it's a place to note that there is a pretty complete set of gguf quantized mpt models available at my Huggingface Account, with a handy mpt-collection

maddes8cht avatar Nov 01 '23 21:11 maddes8cht

Implemented in #3417

Galunid avatar Nov 02 '23 00:11 Galunid