llama.cpp
llama.cpp copied to clipboard
Support StableLM From StabilityAI
Blog Post Announcement (It may be using the same architecture as GPT-NeoX)
In case these links 404 due to being posted early by accident: https://archive.is/ZQszO https://archive.ph/U0Pr8
(Checkpoint links are Hugging Face repos with model weights)
Size | StableLM-Base-Alpha | StableLM-Tuned-Alpha | Training Tokens [in progress] | Context Window | Web Demo |
---|---|---|---|---|---|
3B | checkpoint | checkpoint | 800B [1.5T]* | 4096 | |
7B | checkpoint | checkpoint | 800B [1.5T]* | 4096 | HuggingFace |
15B | (in progress) | (pending) | 1.5T* | ||
30B | (in progress) | (pending) | 1.5T* | ||
65B | (in progress) | (pending) | 1.5T* | ||
175B | (planned) |
*3T Planned
are they just new GPT-NeoX
models? or did they forget to update the model cards on hf ? :smile:
related https://github.com/ggerganov/ggml/issues/10
This was quick! 😅
They've included a bit in the ReadMe indicating that compatibility with llama.cpp is actively desired. :)
EDIT: related HN thread https://news.ycombinator.com/item?id=35629127
This models will be compatible with llama.cpp?
Definitely interested in this. Interesting that they specifically highlight wanting llama.cpp/ggml support.
If it really is GPT NeoX, this repo has conversion, quantization, and support for basic inference for GPT NeoX and other model formats. https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/converters/convert_gptneox_to_ggml.py https://github.com/NolanoOrg/cformers/blob/master/cformers/cpp/quantize_gptneox.cpp
Here is a very quick and dirty implementation using ggml
:
https://github.com/ggerganov/ggml/pull/96
Also, found a bug in multi-threaded ggml_cpy()
:
https://github.com/ggerganov/ggml/pull/96/files#diff-b4a500ab2765c31526c5541f3e51e21e46990b87d9774cac6f3089db315bdc5bR5655-R5660
are they just new
GPT-NeoX
models? or did they forget to update the model cards on hf ? smile
Is it?
Yes it's using GPT-NeoX architecture. The model details can be seen here: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-base-alpha-7b.yaml
# model settings
"num-layers": 16,
"hidden-size": 6144,
"num-attention-heads": 48,
"seq-length": 4096,
"max-position-embeddings": 4096,
# architecture design
"norm": "layernorm",
"pos-emb": "rotary",
"rotary_pct": 0.25,
"activation": "gelu",
"no-weight-tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",
Merged in ggml
: https://github.com/ggerganov/ggml/tree/master/examples/stablelm
The q4_x files output from ggml are not compatible with llama.cpp?
The q4_x files output from ggml are not compatible with llama.cpp?
It seems so currently.
I've converted/quantized stablelm-tuned-alpha-7b to Q4_3 and it works great with ggml, but llama.cpp throws error loading model: missing tok_embeddings.weight
, seems like some support is missing.
I am getting the same error
Are you using the specific binary for stablelm? It seems separated from the looks of it in https://github.com/ggerganov/ggml/tree/master/examples/stablelm
Are there plan to integrate ggml/examples/stablelm into llama.cpp? Also it would be great if a single llama.cpp binary is able to use also gpt-2 and gpt-j.
There seems to be a bug in the existing StableLM implementation in ggml
.
See the updated README for more details:
https://github.com/ggerganov/ggml/tree/master/examples/stablelm#warning
Best way to fix this is to compare outputs with the reference implementation. Any help will be appreciated.
So, I ran the HF transformers implementation and I observe the same "increasing magnitude" behaviour as in the ggml
implementation.
To do this, I changed the following line:
https://github.com/huggingface/transformers/blob/c2c99dc7ef5edab8f7674a1eb00cf6ac6996fd0f/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L234
to:
print(attn_scores);
attn_weights = nn.functional.softmax(attn_scores, dim=-1)
Here is the output log from a sample run:
For comparison, here is running GPT-2 using HF transformers with the same change:
Notice how the GPT-2 values are all well below 1e1
for each layer, while the StableLM numbers jump all the way up to 1e3
.
The GPT-2 behaviour is also observed for GPT-J and LLaMA models (these are the models that I currently play with the most). To me, it kind of makes sense to be this way and it seems to be correct, while the StableLM numbers are weird.
So is my understanding incorrect or is there something wrong with the StableLM model?
In any case, I no longer think there is a bug in the ggml
implementation.
I believe this behavior is correct and is a result of how the models were trained. The text output seems to be coherent and the values only rarely converge to -inf. I may be out of line, but is it possible this is normal? I will continue to look further into this but I doubt softmax would work at all if this was a major issue. If you have any further insight I would love to dive deeper.
is it possible this is normal?
Absolutely. It's just my intuitive understanding that the scaling before the soft max layer has the purpose of preventing exactly this kind of magnitude increase. But I could be wrong and this is fine.