llama.cpp Support StableLM From StabilityAI

Blog Post Announcement (It may be using the same architecture as GPT-NeoX)

In case these links 404 due to being posted early by accident: https://archive.is/ZQszO https://archive.ph/U0Pr8

(Checkpoint links are Hugging Face repos with model weights)

Size	StableLM-Base-Alpha	StableLM-Tuned-Alpha	Training Tokens [in progress]	Context Window	Web Demo
3B	checkpoint	checkpoint	800B [1.5T]*	4096
7B	checkpoint	checkpoint	800B [1.5T]*	4096	HuggingFace
15B	(in progress)	(pending)	1.5T*
30B	(in progress)	(pending)	1.5T*
65B	(in progress)	(pending)	1.5T*
175B	(planned)

*3T Planned

Apr 19 '23 15:04 MarkSchmidty

are they just new GPT-NeoX models? or did they forget to update the model cards on hf ? :smile:

Apr 19 '23 16:04 Green-Sky

related https://github.com/ggerganov/ggml/issues/10

Apr 19 '23 16:04 Green-Sky

This was quick! 😅

They've included a bit in the ReadMe indicating that compatibility with llama.cpp is actively desired. :)

EDIT: related HN thread https://news.ycombinator.com/item?id=35629127

Apr 19 '23 16:04 jessejohnson

This models will be compatible with llama.cpp?

Apr 19 '23 16:04 NoNamedCat

Definitely interested in this. Interesting that they specifically highlight wanting llama.cpp/ggml support.

Apr 19 '23 19:04 rabidcopy

If it really is GPT NeoX, this repo has conversion, quantization, and support for basic inference for GPT NeoX and other model formats. https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/converters/convert_gptneox_to_ggml.py https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/quantize_gptneox.cpp

Apr 19 '23 20:04 rabidcopy

Here is a very quick and dirty implementation using ggml:

https://github.com/ggerganov/ggml/pull/96

Also, found a bug in multi-threaded ggml_cpy():

https://github.com/ggerganov/ggml/pull/96/files#diff-b4a500ab2765c31526c5541f3e51e21e46990b87d9774cac6f3089db315bdc5bR5655-R5660

Apr 19 '23 21:04 ggerganov

are they just new GPT-NeoX models? or did they forget to update the model cards on hf ? smile

Is it?

Apr 20 '23 00:04 acheong08

Yes it's using GPT-NeoX architecture. The model details can be seen here: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-base-alpha-7b.yaml

  # model settings
  "num-layers": 16,
  "hidden-size": 6144,
  "num-attention-heads": 48,
  "seq-length": 4096,
  "max-position-embeddings": 4096,

  # architecture design
  "norm": "layernorm",
  "pos-emb": "rotary",
  "rotary_pct": 0.25,
  "activation": "gelu",
  "no-weight-tying": true,
  "gpt_j_residual": true,
  "output_layer_parallelism": "column",

Apr 20 '23 00:04 MarkSchmidty

Merged in ggml: https://github.com/ggerganov/ggml/tree/master/examples/stablelm

Apr 20 '23 20:04 ggerganov

The q4_x files output from ggml are not compatible with llama.cpp?

Apr 21 '23 22:04 mhkhung

The q4_x files output from ggml are not compatible with llama.cpp?

It seems so currently.

Apr 22 '23 23:04 fgdfgfthgr-fox

I've converted/quantized stablelm-tuned-alpha-7b to Q4_3 and it works great with ggml, but llama.cpp throws error loading model: missing tok_embeddings.weight, seems like some support is missing.

Apr 24 '23 14:04 magicrobotmonkey

I am getting the same error

Apr 24 '23 19:04 AndreiSva

Are you using the specific binary for stablelm? It seems separated from the looks of it in https://github.com/ggerganov/ggml/tree/master/examples/stablelm

Apr 24 '23 20:04 mikeggh

Are there plan to integrate ggml/examples/stablelm into llama.cpp? Also it would be great if a single llama.cpp binary is able to use also gpt-2 and gpt-j.

Apr 25 '23 05:04 wkkautas

There seems to be a bug in the existing StableLM implementation in ggml. See the updated README for more details:

https://github.com/ggerganov/ggml/tree/master/examples/stablelm#warning

Best way to fix this is to compare outputs with the reference implementation. Any help will be appreciated.

Apr 27 '23 16:04 ggerganov

So, I ran the HF transformers implementation and I observe the same "increasing magnitude" behaviour as in the ggml implementation.

To do this, I changed the following line:

https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L234

to:

        print(attn_scores);
        attn_weights = nn.functional.softmax(attn_scores, dim=-1)

Here is the output log from a sample run:

softmax-stablelm.txt

For comparison, here is running GPT-2 using HF transformers with the same change:

softmax-gpt-2.txt

Notice how the GPT-2 values are all well below 1e1 for each layer, while the StableLM numbers jump all the way up to 1e3. The GPT-2 behaviour is also observed for GPT-J and LLaMA models (these are the models that I currently play with the most). To me, it kind of makes sense to be this way and it seems to be correct, while the StableLM numbers are weird.

So is my understanding incorrect or is there something wrong with the StableLM model? In any case, I no longer think there is a bug in the ggml implementation.

Apr 28 '23 15:04 ggerganov

I believe this behavior is correct and is a result of how the models were trained. The text output seems to be coherent and the values only rarely converge to -inf. I may be out of line, but is it possible this is normal? I will continue to look further into this but I doubt softmax would work at all if this was a major issue. If you have any further insight I would love to dive deeper.

Apr 29 '23 09:04 byroneverson

is it possible this is normal?

Absolutely. It's just my intuitive understanding that the scaling before the soft max layer has the purpose of preventing exactly this kind of magnitude increase. But I could be wrong and this is fine.

Apr 29 '23 18:04 ggerganov

llama.cpp llama.cpp copied to clipboard

Support StableLM From StabilityAI

llama.cpp
llama.cpp copied to clipboard