Feature Request: Support Codestral Mamba
Feature Description
New 7B coding model just released by Mistral.
- Blog Post: https://mistral.ai/news/codestral-mamba/
- HF: https://huggingface.co/mistralai/mamba-codestral-7B-v0.1
Motivation
Seems to perform very well, especially for a 7B model:
Possible Implementation
An extension to https://github.com/ggerganov/llama.cpp/issues/7727?
I love the shout-out in the linked blog post!
You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.
That's a really nice nod -- love to see it!
#7727 should cover for this model, but with untied embeddings unlike the other Mamba2 models.
FYI, there is an "ngroups" param that changes how layer norm is done : https://github.com/state-spaces/mamba/blob/c0a00bd1808881831ddf43206c69362d4df90cf7/mamba_ssm/modules/mamba2.py#L47
We use ngroups=8. If you forget it or try with ngroups = 1 you'll have a bad time.
Good luck !
After we merge https://github.com/ggerganov/llama.cpp/pull/8526 we should try to add full support for this model. cc @compilade
I'd love this.
thanks!
Hey guys, any progress on ETA for it?
For anyone else, seems this is waiting on https://github.com/ggerganov/llama.cpp/pull/8526 which is waiting on https://github.com/ggerganov/llama.cpp/pull/8980 -> which is waiting on review(?).
Some progress report: I have a local branch (not yet public) on top of #8526 in which I've started implementing the graph for Mamba-2. The conv step is very similar to Mamba-1, and I've started to implement the SSM step and will continue in the next days. It's not in a usable state yet.
I'm starting by implementing the fully recurrent mode of Mamba-2 (which is very similar to Mamba-1) (and which is described in Section 3.4.1).
But I'm still evaluating how the block decomposition would fit within how src/llama.cpp manages batches and/or if the chunk size should be dynamic. It seems like to fully benefit from Section 6, the chunks should be smaller than the batch size, but not too small, at which point directly doing the recurrence is the same. Since the ggml compute graph nodes should keep the same structure between batches and that the block decomposition will likely have too much overhead for small batches, it's easier to simply go with the linear recurrence with something like ggml_ssm_scan at first.
For the ETA, I'll try to get it working before the end of August, but no promises.
(and BTW @rmusser01, #8980 is waiting on #8526, not the other way around, at least I think?)
Okay, the fully recurrent mode works for Mamba-2! (for the curious, see this branch: https://github.com/compilade/llama.cpp/tree/compilade/mamba2) I'll open a PR soon (in the next days; still need to clean up some things).
Note that Mamba-Codestral-7B-v0.1 cannot be converted as-is; either use https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/discussions/9, or rename consolidated.safetensors to model.safetensors, tokenizer.model.v3 to tokenizer.model, and params.json to config.json. Then, in config.json, the line "architectures": ["Mamba2ForCausalLM"], needs to be added (if missing).
The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes 263.5 MiB (in F32) per sequence (e.g. with -np 1), compared to 38 MiB for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size.
A big downside right now with recurrent models in llama.cpp is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if using llama-server. I think using llama-cli in conversation mode does not have this problem, however (or maybe only the bare interactive mode with --in-prefix and --in-suffix, not sure).
The implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of Mamba-2-130M is similar or better than Mamba-130M (but still not that fast compared to transformer-based models with an empty context).
The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.
Just making sure expectations are not too far from reality.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
We need this. Any news on when this will be available?
Would love to get an update on this.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Heads up that #15625 fixes a problem in the implementation of SSM_SCAN, which makes this model (Mamba-Codestral-7B-v0.1) better than it was when initially implemented here.
So if you had some initial impressions from this model from when it was first implemented here, it might be relevant to revisit this model after the recent changes.